Paperid: 1, https://arxiv.org/pdf/2501.19406.pdf   GitHub
Authors:Matthew Chen, Joshua Engels, Max Tegmark
Title: Low-Rank Adapting Models for Sparse Autoencoders
Abstract:
Sparse autoencoders (SAEs) decompose language model representations into a sparse set of linear latent vectors. Recent works have improved SAEs using language model gradients, but these techniques require many expensive backward passes during training and still cause a significant increase in cross entropy loss when SAE reconstructions are inserted into the model. In this work, we improve on these limitations by taking a fundamentally different approach: we use low-rank adaptation (LoRA) to finetune the \textit{language model itself} around a previously trained SAE. We analyze our method across SAE sparsity, SAE width, language model size, LoRA rank, and model layer on the Gemma Scope family of SAEs. In these settings, our method reduces the cross entropy loss gap by 30\% to 55\% when SAEs are inserted during the forward pass. We also find that compared to end-to-end (e2e) SAEs, our approach achieves the same downstream cross entropy loss 3$\times$ to 20$\times$ faster on \gemma and 2$\times$ to 10$\times$ faster on \llama. We further show that our technique improves downstream metrics and can adapt multiple SAEs at once without harming general language model capabilities. Our results demonstrate that improving model interpretability is not limited to post-hoc SAE training; Pareto improvements can also be achieved by directly optimizing the model itself.
Chinese: 本研究采用低秩自适应(LoRA)技术对预训练稀疏自编码器(SAE)周围的语言模型进行微调,将交叉熵损失差距降低30%至55%,相比传统方法实现3到20倍的加速收敛,同时保持模型性能不受影响。
English: This work introduces a novel approach using low-rank adaptation (LoRA) to fine-tune language models around pre-trained sparse autoencoders (SAEs), significantly reducing cross-entropy loss gaps by 30% to 55% and achieving faster convergence compared to traditional methods while maintaining model performance.

Authors:Andrey Polubarov, Nikita Lyubaykin, Alexander Derevyagin, Ilya Zisman, Denis Tarasov, Alexander Nikulin, Vladislav Kurenkov
Title: Vintix: Action Model via In-Context Reinforcement Learning
Abstract:
In-Context Reinforcement Learning (ICRL) represents a promising paradigm for developing generalist agents that learn at inference time through trial-and-error interactions, analogous to how large language models adapt contextually, but with a focus on reward maximization. However, the scalability of ICRL beyond toy tasks and single-domain settings remains an open challenge. In this work, we present the first steps toward scaling ICRL by introducing a fixed, cross-domain model capable of learning behaviors through in-context reinforcement learning. Our results demonstrate that Algorithm Distillation, a framework designed to facilitate ICRL, offers a compelling and competitive alternative to expert distillation to construct versatile action models. These findings highlight the potential of ICRL as a scalable approach for generalist decision-making systems. Code to be released at https://github.com/dunnolab/vintix
中文摘要:情境强化学习(ICRL)通过推理阶段的试错学习展现出开发通用智能体的潜力,其中算法蒸馏框架成为构建多功能行动模型的优越替代方案,优于专家蒸馏方法。
English Summary: In-Context Reinforcement Learning (ICRL) shows potential for developing generalist agents through trial-and-error learning at inference time, with Algorithm Distillation emerging as a competitive alternative to expert distillation for creating versatile action models.

Authors:Andrey Polubarov, Nikita Lyubaykin, Alexander Derevyagin, Ilya Zisman, Denis Tarasov, Alexander Nikulin, Vladislav Kurenkov
Title: Vintix: Action Model via In-Context Reinforcement Learning
Abstract:
In-Context Reinforcement Learning (ICRL) represents a promising paradigm for developing generalist agents that learn at inference time through trial-and-error interactions, analogous to how large language models adapt contextually, but with a focus on reward maximization. However, the scalability of ICRL beyond toy tasks and single-domain settings remains an open challenge. In this work, we present the first steps toward scaling ICRL by introducing a fixed, cross-domain model capable of learning behaviors through in-context reinforcement learning. Our results demonstrate that Algorithm Distillation, a framework designed to facilitate ICRL, offers a compelling and competitive alternative to expert distillation to construct versatile action models. These findings highlight the potential of ICRL as a scalable approach for generalist decision-making systems. Code released at https://github.com/dunnolab/vintix
中文摘要:情境强化学习(ICRL)通过推理阶段的试错学习展现出开发通用智能体的潜力,其中算法蒸馏框架成为构建多功能行动模型的优越替代方案,优于专家蒸馏方法。
English Summary: In-Context Reinforcement Learning (ICRL) shows potential for developing generalist agents through trial-and-error learning at inference time, with Algorithm Distillation emerging as a competitive alternative to expert distillation for creating versatile action models.

Authors:Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, Tatsunori Hashimoto
Title: s1: Simple test-time scaling
Abstract:
Test-time scaling is a promising new approach to language modeling that uses extra test-time compute to improve performance. Recently, OpenAI's o1 model showed this capability but did not publicly share its methodology, leading to many replication efforts. We seek the simplest approach to achieve test-time scaling and strong reasoning performance. First, we curate a small dataset s1K of 1,000 questions paired with reasoning traces relying on three criteria we validate through ablations: difficulty, diversity, and quality. Second, we develop budget forcing to control test-time compute by forcefully terminating the model's thinking process or lengthening it by appending "Wait" multiple times to the model's generation when it tries to end. This can lead the model to double-check its answer, often fixing incorrect reasoning steps. After supervised finetuning the Qwen2.5-32B-Instruct language model on s1K and equipping it with budget forcing, our model s1-32B exceeds o1-preview on competition math questions by up to 27% (MATH and AIME24). Further, scaling s1-32B with budget forcing allows extrapolating beyond its performance without test-time intervention: from 50% to 57% on AIME24. Our model, data, and code are open-source at https://github.com/simplescaling/s1
中文摘要:本研究通过构建精选数据集和预算强制技术,提出了一种简单的测试时扩展方法,使Qwen2.5-32B模型在数学推理任务上超越OpenAI的o1-preview,同时保持完全开源。
English Summary: This study introduces a simple test-time scaling method using a curated dataset and budget forcing technique, enabling the Qwen2.5-32B model to outperform OpenAI's o1-preview on math reasoning tasks while being fully open-source.

Authors:Wenzhi Fang, Dong-Jun Han, Liangqi Yuan, Seyyedali Hosseinalipour, Christopher G. Brinton
Title: Federated Sketching LoRA: On-Device Collaborative Fine-Tuning of Large Language Models
Abstract:
Fine-tuning large language models (LLMs) on devices remains a challenging problem. Recent works have fused low-rank adaptation (LoRA) techniques with federated fine-tuning to mitigate challenges associated with device model sizes and data scarcity. Still, the heterogeneity of resources remains a critical bottleneck: while higher-rank modules generally enhance performance, varying device capabilities constrain LoRA's feasible rank range. Existing approaches attempting to resolve this issue either lack analytical justification or impose additional computational overhead, leaving a wide gap for efficient and theoretically-grounded solutions. To address these challenges, we propose federated sketching LoRA (FSLoRA), which leverages a sketching mechanism to enable devices to selectively update submatrices of global LoRA modules maintained by the server. By adjusting the sketching ratios, which determine the ranks of the submatrices on the devices, FSLoRA flexibly adapts to device-specific communication and computational constraints. We provide a rigorous convergence analysis of FSLoRA that characterizes how the sketching ratios affect the convergence rate. Through comprehensive experiments on multiple datasets and LLM models, we demonstrate FSLoRA's performance improvements compared to various baselines. The code is available at https://github.com/wenzhifang/Federated-Sketching-LoRA-Implementation.
中文: 在资源受限的客户端微调大语言模型具有挑战性,但FSLoRA通过草图机制使客户端能选择性更新全局LoRA模块的子矩阵,灵活适应不同客户端的通信与计算限制,并提供了严格的收敛性分析和实验性能提升验证。
English: Fine-tuning large language models on resource-limited clients is challenging, but FSLoRA introduces a sketching mechanism that allows clients to selectively update submatrices of global LoRA modules, adapting to their specific constraints while providing rigorous convergence analysis and demonstrating performance gains in experiments.

Authors:Xingyou Song, Dara Bahri
Title: Decoding-based Regression
Abstract:
Language models have recently been shown capable of performing regression wherein numeric predictions are represented as decoded strings. In this work, we provide theoretical grounds for this capability and furthermore investigate the utility of causal sequence decoding models as numeric regression heads given any feature representation. We find that, despite being trained in the usual way - for next-token prediction via cross-entropy loss - decoder-based heads are as performant as standard pointwise heads when benchmarked over standard regression tasks, while being flexible enough to capture smooth numeric distributions, such as in the task of density estimation.
中文摘要:语言模型能够通过将数值预测解码为字符串来有效执行回归任务,其中基于解码器的回归头在性能上与标准方法相当,同时能更好地捕捉平滑数值分布,如密度估计。
English Summary: Language models can effectively perform numeric regression by decoding predictions as strings, with decoder-based heads matching the performance of standard regression methods while offering greater flexibility for tasks like density estimation.

Authors:Liudi Yang, Ruben Mascaro, Ignacio Alzugaray, Sai Manoj Prakhya, Marco Karrer, Ziyuan Liu, Margarita Chli
Title: LiDAR Loop Closure Detection using Semantic Graphs with Graph Attention Networks
Abstract:
In this paper, we propose a novel loop closure detection algorithm that uses graph attention neural networks to encode semantic graphs to perform place recognition and then use semantic registration to estimate the 6 DoF relative pose constraint. Our place recognition algorithm has two key modules, namely, a semantic graph encoder module and a graph comparison module. The semantic graph encoder employs graph attention networks to efficiently encode spatial, semantic and geometric information from the semantic graph of the input point cloud. We then use self-attention mechanism in both node-embedding and graph-embedding steps to create distinctive graph vectors. The graph vectors of the current scan and a keyframe scan are then compared in the graph comparison module to identify a possible loop closure. Specifically, employing the difference of the two graph vectors showed a significant improvement in performance, as shown in ablation studies. Lastly, we implemented a semantic registration algorithm that takes in loop closure candidate scans and estimates the relative 6 DoF pose constraint for the LiDAR SLAM system. Extensive evaluation on public datasets shows that our model is more accurate and robust, achieving 13% improvement in maximum F1 score on the SemanticKITTI dataset, when compared to the baseline semantic graph algorithm. For the benefit of the community, we open-source the complete implementation of our proposed algorithm and custom implementation of semantic registration at https://github.com/crepuscularlight/SemanticLoopClosure
中文: 本文提出了一种新颖的回环检测算法,通过图注意力网络编码语义图进行位置识别,并利用语义配准估计6自由度位姿约束,在SemanticKITTI数据集上实现了F1分数13%的提升。
English: This paper introduces a loop closure detection algorithm using graph attention networks to encode semantic graphs for place recognition and semantic registration to estimate 6 DoF pose constraints, achieving a 13% improvement in F1 score on SemanticKITTI.

Authors:Natalie Maus, Kyurae Kim, Yimeng Zeng, Haydn Thomas Jones, Fangping Wan, Marcelo Der Torossian Torres, Cesar de la Fuente-Nunez, Jacob R. Gardner
Title: Covering Multiple Objectives with a Small Set of Solutions Using Bayesian Optimization
Abstract:
In multi-objective black-box optimization, the goal is typically to find solutions that optimize a set of $T$ black-box objective functions, $f_1$, ..., $f_T$, simultaneously. Traditional approaches often seek a single Pareto-optimal set that balances trade-offs among all objectives. In this work, we consider a problem setting that departs from this paradigm: finding a small set of K < T solutions, that collectively "covers" the T objectives. A set of solutions is defined as "covering" if, for each objective $f_1$, ..., $f_T$, there is at least one good solution. A motivating example for this problem setting occurs in drug design. For example, we may have T pathogens and aim to identify a set of K < T antibiotics such that at least one antibiotic can be used to treat each pathogen. To address this problem, we propose Multi-Objective Coverage Bayesian Optimization (MOCOBO), a principled algorithm designed to efficiently find a covering set. We validate our approach through experiments on challenging high-dimensional tasks, including applications in peptide and molecular design, where MOCOBO is shown to find high-performing covering sets of solutions. The results show that the coverage of the K < T solutions found by MOCOBO matches or nearly matches the coverage of T solutions obtained by optimizing each objective individually. Furthermore, in in vitro experiments, the peptides found by MOCOBO exhibited high potency against drug-resistant pathogens, further demonstrating the potential of MOCOBO for drug discovery. We make code available here: https://github.com/nataliemaus/mocobo.
中文: 本文提出了多目标覆盖贝叶斯优化(MOCOBO)算法,通过寻找K < T个解的集合来有效覆盖所有T个目标,在药物设计等应用中展现出优越性能。
English: This paper introduces Multi-Objective Coverage Bayesian Optimization (MOCOBO), a novel algorithm that efficiently identifies a small set of K < T solutions to collectively cover all T objectives in multi-objective black-box optimization, with demonstrated success in drug design applications.

Authors:Baohao Liao, Yuhui Xu, Hanze Dong, Junnan Li, Christof Monz, Silvio Savarese, Doyen Sahoo, Caiming Xiong
Title: Reward-Guided Speculative Decoding for Efficient LLM Reasoning
Abstract:
We introduce Reward-Guided Speculative Decoding (RSD), a novel framework aimed at improving the efficiency of inference in large language models (LLMs). RSD synergistically combines a lightweight draft model with a more powerful target model, incorporating a controlled bias to prioritize high-reward outputs, in contrast to existing speculative decoding methods that enforce strict unbiasedness. RSD employs a process reward model to evaluate intermediate decoding steps and dynamically decide whether to invoke the target model, optimizing the trade-off between computational cost and output quality. We theoretically demonstrate that a threshold-based mixture strategy achieves an optimal balance between resource utilization and performance. Extensive evaluations on challenging reasoning benchmarks, including Olympiad-level tasks, show that RSD delivers significant efficiency gains against decoding with the target model only (up to 4.4x fewer FLOPs), while achieving significant better accuracy than parallel decoding method on average (up to +3.5). These results highlight RSD as a robust and cost-effective approach for deploying LLMs in resource-intensive scenarios. The code is available at https://github.com/BaohaoLiao/RSD.
Chinese: RSD是一种新颖的推测解码框架,通过结合轻量级草稿模型与强大目标模型,并利用过程奖励模型动态优化计算成本与输出质量,显著提升大语言模型推理效率,最高减少4.4倍计算量且提高准确率。
English: RSD is a novel speculative decoding framework that enhances LLM inference efficiency by integrating a lightweight draft model with a powerful target model and using a process reward model to dynamically optimize computational cost and output quality, achieving up to 4.4x fewer FLOPs and improved accuracy.

Authors:Nafis Irtiza Tripto, Saranya Venkatraman, Mahjabin Nahar, Dongwon Lee
Title: Beyond checkmate: exploring the creative chokepoints in AI text
Abstract:
The rapid advancement of Large Language Models (LLMs) has revolutionized text generation but also raised concerns about potential misuse, making detecting LLM-generated text (AI text) increasingly essential. While prior work has focused on identifying AI text and effectively checkmating it, our study investigates a less-explored territory: portraying the nuanced distinctions between human and AI texts across text segments (introduction, body, and conclusion). Whether LLMs excel or falter in incorporating linguistic ingenuity across text segments, the results will critically inform their viability and boundaries as effective creative assistants to humans. Through an analogy with the structure of chess games, comprising opening, middle, and end games, we analyze segment-specific patterns to reveal where the most striking differences lie. Although AI texts closely resemble human writing in the body segment due to its length, deeper analysis shows a higher divergence in features dependent on the continuous flow of language, making it the most informative segment for detection. Additionally, human texts exhibit greater stylistic variation across segments, offering a new lens for distinguishing them from AI. Overall, our findings provide fresh insights into human-AI text differences and pave the way for more effective and interpretable detection strategies. Codes available at https://github.com/tripto03/chess_inspired_human_ai_text_distinction.
中文摘要:本研究揭示,尽管AI生成文本在主体段落与人类写作高度相似,但在需要语言连贯性的特征上差异显著,且人类文本在不同段落间风格变化更大,为检测提供了新视角。
English Summary: This study reveals that while AI-generated text closely mimics human writing in body segments, it shows greater divergence in features requiring continuous language flow, and human texts display more stylistic variation across different segments, offering new insights for detection.

Authors:Nafis Irtiza Tripto, Saranya Venkatraman, Mahjabin Nahar, Dongwon Lee
Title: Beyond checkmate: exploring the creative chokepoints in AI text
Abstract:
The rapid advancement of Large Language Models (LLMs) has revolutionized text generation but also raised concerns about potential misuse, making detecting LLM-generated text (AI text) increasingly essential. While prior work has focused on identifying AI text and effectively checkmating it, our study investigates a less-explored territory: portraying the nuanced distinctions between human and AI texts across text segments (introduction, body, and conclusion). Whether LLMs excel or falter in incorporating linguistic ingenuity across text segments, the results will critically inform their viability and boundaries as effective creative assistants to humans. Through an analogy with the structure of chess games, comprising opening, middle, and end games, we analyze segment-specific patterns to reveal where the most striking differences lie. Although AI texts closely resemble human writing in the body segment due to its length, deeper analysis shows a higher divergence in features dependent on the continuous flow of language, making it the most informative segment for detection. Additionally, human texts exhibit greater stylistic variation across segments, offering a new lens for distinguishing them from AI. Overall, our findings provide fresh insights into human-AI text differences and pave the way for more effective and interpretable detection strategies. Codes available at https://github.com/tripto03/chess_inspired_human_ai_text_distinction.
中文摘要:本研究揭示,尽管AI生成文本在主体段落与人类写作高度相似,但在需要语言连贯性的特征上差异显著,且人类文本在不同段落间风格变化更大,为检测提供了新视角。
English Summary: This study reveals that while AI-generated text closely mimics human writing in body segments, it shows greater divergence in features requiring continuous language flow, and human texts display more stylistic variation across different segments, offering new insights for detection.

Authors:Yuta Oshima, Masahiro Suzuki, Yutaka Matsuo, Hiroki Furuta
Title: Inference-Time Text-to-Video Alignment with Diffusion Latent Beam Search
Abstract:
The remarkable progress in text-to-video diffusion models enables photorealistic generations, although the contents of the generated video often include unnatural movement or deformation, reverse playback, and motionless scenes. Recently, an alignment problem has attracted huge attention, where we steer the output of diffusion models based on some quantity on the goodness of the content. Because there is a large room for improvement of perceptual quality along the frame direction, we should address which metrics we should optimize and how we can optimize them in the video generation. In this paper, we propose diffusion latent beam search with lookahead estimator, which can select a better diffusion latent to maximize a given alignment reward, at inference time. We then point out that the improvement of perceptual video quality considering the alignment to prompts requires reward calibration by weighting existing metrics. This is because when humans or vision language models evaluate outputs, many previous metrics to quantify the naturalness of video do not always correlate with evaluation. We demonstrate that our method improves the perceptual quality evaluated on the calibrated reward, VLMs, and human assessment, without model parameter update, and outputs the best generation compared to greedy search and best-of-N sampling under much more efficient computational cost. The experiments highlight that our method is beneficial to many capable generative models, and provide a practical guideline that we should prioritize the inference-time compute allocation into lookahead steps for reward estimation over search budget or denoising steps.
中文摘要:本文针对文本到视频扩散模型生成视频常出现运动不自然和对齐不佳的问题,提出了一种扩散潜在波束搜索方法,通过校准奖励机制优化视频质量,无需更新模型参数即可显著提升生成效果。
English Summary: Text-to-video diffusion models can now produce realistic videos but often suffer from unnatural motion and poor alignment with prompts, which this paper addresses by introducing a diffusion latent beam search method that optimizes video quality through calibrated rewards without updating model parameters.

Authors:Yuta Oshima, Masahiro Suzuki, Yutaka Matsuo, Hiroki Furuta
Title: Inference-Time Text-to-Video Alignment with Diffusion Latent Beam Search
Abstract:
The remarkable progress in text-to-video diffusion models enables the generation of photorealistic videos, although the content of these generated videos often includes unnatural movement or deformation, reverse playback, and motionless scenes. Recently, an alignment problem has attracted huge attention, where we steer the output of diffusion models based on some measure of the content's goodness. Because there is a large room for improvement of perceptual quality along the frame direction, we should address which metrics we should optimize and how we can optimize them in the video generation. In this paper, we propose diffusion latent beam search with lookahead estimator, which can select a better diffusion latent to maximize a given alignment reward at inference time. We then point out that improving perceptual video quality with respect to alignment to prompts requires reward calibration by weighting existing metrics. This is because when humans or vision language models evaluate outputs, many previous metrics to quantify the naturalness of video do not always correlate with the evaluation. We demonstrate that our method improves the perceptual quality evaluated on the calibrated reward, VLMs, and human assessment, without model parameter update, and outputs the best generation compared to greedy search and best-of-N sampling under much more efficient computational cost. The experiments highlight that our method is beneficial to many capable generative models, and provide a practical guideline: we should prioritize the inference-time compute allocation into enabling the lookahead estimator and increasing the search budget, rather than expanding the denoising steps.
中文摘要:本文针对文本到视频扩散模型生成视频常出现运动不自然和对齐不佳的问题,提出了一种扩散潜在波束搜索方法,通过校准奖励机制优化视频质量,无需更新模型参数即可显著提升生成效果。
English Summary: Text-to-video diffusion models can now produce realistic videos but often suffer from unnatural motion and poor alignment with prompts, which this paper addresses by introducing a diffusion latent beam search method that optimizes video quality through calibrated rewards without updating model parameters.

Authors:Junxiang Qiu, Shuo Wang, Jinda Lu, Lin Liu, Houcheng Jiang, Xingyu Zhu, Yanbin Hao
Title: Accelerating Diffusion Transformer via Error-Optimized Cache
Abstract:
Diffusion Transformer (DiT) is a crucial method for content generation. However, it needs a lot of time to sample. Many studies have attempted to use caching to reduce the time consumption of sampling. Existing caching methods accelerate generation by reusing DiT features from the previous time step and skipping calculations in the next, but they tend to locate and cache low-error modules without focusing on reducing caching-induced errors, resulting in a sharp decline in generated content quality when increasing caching intensity. To solve this problem, we propose the \textbf{E}rror-\textbf{O}ptimized \textbf{C}ache (\textbf{EOC}). This method introduces three key improvements: \textbf{(1)} Prior knowledge extraction: Extract and process the caching differences; \textbf{(2)} A judgment method for cache optimization: Determine whether certain caching steps need to be optimized; \textbf{(3)} Cache optimization: reduce caching errors. Experiments show that this algorithm significantly reduces the error accumulation caused by caching, especially excessive caching. On the ImageNet dataset, without substantially increasing the computational load, this method improves the FID of the generated images when the rule-based model FORA has a caching level of \textbf{75}\%, \textbf{50}\%, and \textbf{25}\%, and the training-based model Learning-to-cache has a caching level of \textbf{22}\%. Specifically, the FID values change from 30.454 to 21.690 (\textbf{28.8}\%), from 6.857 to 5.821 (\textbf{15.1}\%), from 3.870 to 3.692 (\textbf{4.6}\%), and from 3.539 to 3.451 (\textbf{2.5}\%) respectively. Code is available at https://github.com/qiujx0520/EOC_MM2025.git.
中文: 提出的误差优化缓存(EOC)方法通过引入先验知识提取、缓存优化判断和误差减少技术,有效降低了扩散变换器采样中的误差累积,在高缓存强度下显著提升了生成图像的FID等质量指标。
English: The proposed Error-Optimized Cache (EOC) method effectively reduces error accumulation in Diffusion Transformers during sampling by introducing prior knowledge extraction, cache optimization judgment, and error reduction techniques, significantly improving image quality metrics like FID under high caching intensities.

Authors:Arsenii Gavrikov, Julián García Pardiñas, Alberto Garfagnini
Title: DINAMO: Dynamic and INterpretable Anomaly MOnitoring for Large-Scale Particle Physics Experiments
Abstract:
Ensuring reliable data collection in large-scale particle physics experiments demands Data Quality Monitoring (DQM) procedures to detect possible detector malfunctions and preserve data integrity. Traditionally, this resource-intensive task has been handled by human shifters who struggle with frequent changes in operational conditions. We present DINAMO: a novel, interpretable, robust, and scalable DQM framework designed to automate anomaly detection in time-dependent settings. Our approach constructs evolving histogram templates with built-in uncertainties, featuring both a statistical variant - extending the classical Exponentially Weighted Moving Average (EWMA) - and a machine learning (ML)-enhanced version that leverages a transformer encoder for improved adaptability. Experimental validations on synthetic datasets demonstrate the high accuracy, adaptability, and interpretability of these methods. The statistical variant is being commissioned in the LHCb experiment at the Large Hadron Collider, underscoring its real-world impact. The code used in this study is available at https://github.com/ArseniiGav/DINAMO.
中文摘要:DINAMO提出了一种可解释且可扩展的框架,用于自动化粒子物理实验中的异常检测,结合了统计和机器学习方法,已在LHCb实验中验证其高精度并投入实际应用。
English Summary: DINAMO introduces an interpretable and scalable framework for automating anomaly detection in particle physics experiments, combining statistical and machine learning approaches that have demonstrated high accuracy and are being implemented in the LHCb experiment.

Authors:Valtteri Ala-Salmi, Zeeshan Rasheed, Abdul Malik Sami, Zheying Zhang, Kai-Kristian Kemell, Jussi Rasku, Shahbaz Siddeeq, Mika Saari, Pekka Abrahamsson
Title: Autonomous Legacy Web Application Upgrades Using a Multi-Agent System
Abstract:
The use of Large Language Models (LLMs) for autonomous code generation is gaining attention in emerging technologies. As LLM capabilities expand, they offer new possibilities such as code refactoring, security enhancements, and legacy application upgrades. Many outdated web applications pose security and reliability challenges, yet companies continue using them due to the complexity and cost of upgrades. To address this, we propose an LLM-based multi-agent system that autonomously upgrades legacy web applications to the latest versions. The system distributes tasks across multiple phases, updating all relevant files. To evaluate its effectiveness, we employed Zero-Shot Learning (ZSL) and One-Shot Learning (OSL) prompts, applying identical instructions in both cases. The evaluation involved updating view files and measuring the number and types of errors in the output. For complex tasks, we counted the successfully met requirements. The experiments compared the proposed system with standalone LLM execution, repeated multiple times to account for stochastic behavior. Results indicate that our system maintains context across tasks and agents, improving solution quality over the base model in some cases. This study provides a foundation for future model implementations in legacy code updates. Additionally, findings highlight LLMs' ability to update small outdated files with high precision, even with basic prompts. The source code is publicly available on GitHub: https://github.com/alasalm1/Multi-agent-pipeline.
中文摘要:本研究提出了一种基于大语言模型的多智能体系统,能够自主升级遗留网络应用程序,通过分布式任务管理和跨阶段上下文保持,在部分案例中相比单一模型提升了解决方案质量。
English Summary: This study introduces a multi-agent system utilizing Large Language Models to autonomously upgrade legacy web applications, demonstrating improved performance over standalone LLM approaches through distributed task management and context preservation across development phases.

Authors:Zixi Wang, Yushe Cao, Yubo Huang, Jinzhu Wei, Jingzehua Xu, Shuai Zhang, Xin Lai
Title: Self-Training with Dynamic Weighting for Robust Gradual Domain Adaptation
Abstract:
In this paper, we propose a new method called Self-Training with Dynamic Weighting (STDW), which aims to enhance robustness in Gradual Domain Adaptation (GDA) by addressing the challenge of smooth knowledge migration from the source to the target domain. Traditional GDA methods mitigate domain shift through intermediate domains and self-training but often suffer from inefficient knowledge migration or incomplete intermediate data. Our approach introduces a dynamic weighting mechanism that adaptively balances the loss contributions of the source and target domains during training. Specifically, we design an optimization framework governed by a time-varying hyperparameter $\varrho$ (progressing from 0 to 1), which controls the strength of domain-specific learning and ensures stable adaptation. The method leverages self-training to generate pseudo-labels and optimizes a weighted objective function for iterative model updates, maintaining robustness across intermediate domains. Experiments on rotated MNIST, color-shifted MNIST, portrait datasets, and the Cover Type dataset demonstrate that STDW outperforms existing baselines. Ablation studies further validate the critical role of $\varrho$'s dynamic scheduling in achieving progressive adaptation, confirming its effectiveness in reducing domain bias and improving generalization. This work provides both theoretical insights and a practical framework for robust gradual domain adaptation, with potential applications in dynamic real-world scenarios. The code is available at https://github.com/Dramwig/STDW.
中文: 本文提出动态加权自训练方法(STDW),通过时变参数动态平衡源域和目标域的损失权重,在渐进域适应中提升鲁棒性,并在多个数据集上验证了其优越性能。
English: This paper introduces Self-Training with Dynamic Weighting (STDW), a method that enhances robustness in Gradual Domain Adaptation by dynamically balancing source and target domain losses through a time-varying parameter, demonstrating superior performance across multiple datasets.

Authors:Yunfan Lu, Yanlin Qian, Ziyang Rao, Junren Xiao, Liming Chen, Hui Xiong
Title: RGB-Event ISP: The Dataset and Benchmark
Abstract:
Event-guided imaging has received significant attention due to its potential to revolutionize instant imaging systems. However, the prior methods primarily focus on enhancing RGB images in a post-processing manner, neglecting the challenges of image signal processor (ISP) dealing with event sensor and the benefits events provide for reforming the ISP process. To achieve this, we conduct the first research on event-guided ISP. First, we present a new event-RAW paired dataset, collected with a novel but still confidential sensor that records pixel-level aligned events and RAW images. This dataset includes 3373 RAW images with 2248 x 3264 resolution and their corresponding events, spanning 24 scenes with 3 exposure modes and 3 lenses. Second, we propose a conventional ISP pipeline to generate good RGB frames as reference. This conventional ISP pipleline performs basic ISP operations, e.g.demosaicing, white balancing, denoising and color space transforming, with a ColorChecker as reference. Third, we classify the existing learnable ISP methods into 3 classes, and select multiple methods to train and evaluate on our new dataset. Lastly, since there is no prior work for reference, we propose a simple event-guided ISP method and test it on our dataset. We further put forward key technical challenges and future directions in RGB-Event ISP. In summary, to the best of our knowledge, this is the very first research focusing on event-guided ISP, and we hope it will inspire the community. The code and dataset are available at: https://github.com/yunfanLu/RGB-Event-ISP.
中文摘要:本研究开创性地探索事件引导的图像信号处理(ISP),通过构建新型对齐的事件-RAW数据集并提出基准方法,首次解决如何利用事件数据优化ISP流程而非仅用于后处理的问题。
English Summary: This research pioneers event-guided image signal processing (ISP) by introducing a novel aligned event-RAW dataset and proposing a baseline method, addressing the gap in leveraging events for ISP enhancement rather than just post-processing.

Authors:Hong Huang, Hai Yang, Yuan Chen, Jiaxun Ye, Dapeng Wu
Title: FedRTS: Federated Robust Pruning via Combinatorial Thompson Sampling
Abstract:
Federated Learning (FL) enables collaborative model training across distributed clients without data sharing, but its high computational and communication demands strain resource-constrained devices. While existing methods use dynamic pruning to improve efficiency by periodically adjusting sparse model topologies while maintaining sparsity, these approaches suffer from issues such as greedy adjustments, unstable topologies, and communication inefficiency, resulting in less robust models and suboptimal performance under data heterogeneity and partial client availability. To address these challenges, we propose Federated Robust pruning via combinatorial Thompson Sampling (FedRTS), a novel framework designed to develop robust sparse models. FedRTS enhances robustness and performance through its Thompson Sampling-based Adjustment (TSAdj) mechanism, which uses probabilistic decisions informed by stable, farsighted information instead of deterministic decisions reliant on unstable and myopic information in previous methods. Extensive experiments demonstrate that FedRTS achieves state-of-the-art performance in computer vision and natural language processing tasks while reducing communication costs, particularly excelling in scenarios with heterogeneous data distributions and partial client participation. Our codes are available at: https://github.com/Little0o0/FedRTS
中文摘要:FedRTS提出了一种基于组合汤普森采样的新型联邦学习框架,通过概率决策机制开发鲁棒稀疏模型,在数据异构场景下实现了更优性能并显著降低了通信成本。
English Summary: FedRTS introduces a novel federated learning framework using combinatorial Thompson Sampling to develop robust sparse models, achieving superior performance and reduced communication costs in heterogeneous data environments.

Authors:Zhengqin Lai, Xiaopeng Hong, Yabin Wang, Xiaobai Li
Title: A Benchmark for Incremental Micro-expression Recognition
Abstract:
Micro-expression recognition plays a pivotal role in understanding hidden emotions and has applications across various fields. Traditional recognition methods assume access to all training data at once, but real-world scenarios involve continuously evolving data streams. To respond to the requirement of adapting to new data while retaining previously learned knowledge, we introduce the first benchmark specifically designed for incremental micro-expression recognition. Our contributions include: Firstly, we formulate the incremental learning setting tailored for micro-expression recognition. Secondly, we organize sequential datasets with carefully curated learning orders to reflect real-world scenarios. Thirdly, we define two cross-evaluation-based testing protocols, each targeting distinct evaluation objectives. Finally, we provide six baseline methods and their corresponding evaluation results. This benchmark lays the groundwork for advancing incremental micro-expression recognition research. All source code used in this study will be publicly available at https://github.com/ZhengQinLai/IMER-benchmark.
Chinese: 本研究首次提出了增量式微表情识别的基准,通过制定学习协议、整理有序数据集和提供基线方法,解决了适应动态数据流的挑战,为该领域研究奠定了基础。
English: This study introduces the first benchmark for incremental micro-expression recognition, addressing the challenge of adapting to evolving data streams by establishing learning protocols, curated datasets, and baseline methods to advance research in this field.

Authors:Jialin Zhao, Yingtao Zhang, Carlo Vittorio Cannistraci
Title: Pivoting Factorization: A Compact Meta Low-Rank Representation of Sparsity for Efficient Inference in Large Language Models
Abstract:
The rapid growth of Large Language Models has driven demand for effective model compression techniques to reduce memory and computation costs. Low-rank pruning has gained attention for its GPU compatibility across all densities. However, low-rank pruning struggles to match the performance of semi-structured pruning, often doubling perplexity at similar densities. In this paper, we propose Pivoting Factorization (PIFA), a novel lossless meta low-rank representation that unsupervisedly learns a compact form of any low-rank representation, effectively eliminating redundant information. PIFA identifies pivot rows (linearly independent rows) and expresses non-pivot rows as linear combinations, achieving 24.2% additional memory savings and 24.6% faster inference over low-rank layers at rank = 50% of dimension. To mitigate the performance degradation caused by low-rank pruning, we introduce a novel, retraining-free reconstruction method that minimizes error accumulation (M). MPIFA, combining M and PIFA into an end-to-end framework, significantly outperforms existing low-rank pruning methods, and achieves performance comparable to semi-structured pruning, while surpassing it in GPU efficiency and compatibility. Our code is available at https://github.com/biomedical-cybernetics/pivoting-factorization.
中文摘要:本文提出了Pivoting Factorization (PIFA),一种无损元低秩表示方法,通过消除冗余信息提升模型压缩效果,并结合误差最小化重构的MPIFA框架,在保持与半结构化剪枝相当性能的同时显著提高了GPU效率。
English Summary: The paper introduces Pivoting Factorization (PIFA), a lossless meta low-rank representation that enhances model compression by eliminating redundancy, and MPIFA, an end-to-end framework combining PIFA with error-minimizing reconstruction to achieve performance comparable to semi-structured pruning while improving GPU efficiency.

Authors:Xingyu Miao, Haoran Duan, Yang Bai, Tejal Shah, Jun Song, Yang Long, Rajiv Ranjan, Ling Shao
Title: Laser: Efficient Language-Guided Segmentation in Neural Radiance Fields
Abstract:
In this work, we propose a method that leverages CLIP feature distillation, achieving efficient 3D segmentation through language guidance. Unlike previous methods that rely on multi-scale CLIP features and are limited by processing speed and storage requirements, our approach aims to streamline the workflow by directly and effectively distilling dense CLIP features, thereby achieving precise segmentation of 3D scenes using text. To achieve this, we introduce an adapter module and mitigate the noise issue in the dense CLIP feature distillation process through a self-cross-training strategy. Moreover, to enhance the accuracy of segmentation edges, this work presents a low-rank transient query attention mechanism. To ensure the consistency of segmentation for similar colors under different viewpoints, we convert the segmentation task into a classification task through label volume, which significantly improves the consistency of segmentation in color-similar areas. We also propose a simplified text augmentation strategy to alleviate the issue of ambiguity in the correspondence between CLIP features and text. Extensive experimental results show that our method surpasses current state-of-the-art technologies in both training speed and performance. Our code is available on: https://github.com/xingy038/Laser.git.
Chinese: 本研究提出一种通过语言引导直接蒸馏密集CLIP特征的方法,利用自交叉训练和低秩注意力机制等策略,在训练速度和性能上均超越了现有先进技术,实现了高效的3D场景精确分割。
English: This work introduces a method that efficiently achieves precise 3D segmentation by directly distilling dense CLIP features with language guidance, incorporating strategies like self-cross-training and a low-rank attention mechanism to enhance performance and speed beyond current state-of-the-art technologies.

Authors:Dahye Kim, Deepti Ghadiyaram
Title: Concept Steerers: Leveraging K-Sparse Autoencoders for Test-Time Controllable Generations
Abstract:
Despite the remarkable progress in text-to-image generative models, they are prone to adversarial attacks and inadvertently generate unsafe, unethical content. Existing approaches often rely on fine-tuning models to remove specific concepts, which is computationally expensive, lacks scalability, and/or compromises generation quality. In this work, we propose a novel framework leveraging k-sparse autoencoders (k-SAEs) to enable efficient and interpretable concept manipulation in diffusion models. Specifically, we first identify interpretable monosemantic concepts in the latent space of text embeddings and leverage them to precisely steer the generation away or towards a given concept (e.g., nudity) or to introduce a new concept (e.g., photographic style) -- all during test time. Through extensive experiments, we demonstrate that our approach is very simple, requires no retraining of the base model nor LoRA adapters, does not compromise the generation quality, and is robust to adversarial prompt manipulations. Our method yields an improvement of $\mathbf{20.01\%}$ in unsafe concept removal, is effective in style manipulation, and is $\mathbf{\sim5}$x faster than the current state-of-the-art. Code is available at: https://github.com/kim-dahye/steerers
中文: 本文提出了一种利用k稀疏自编码器的新框架,在推理过程中高效操控扩散模型中的概念,无需重新训练即可精确控制生成内容,同时提升安全性和速度。
English: This paper introduces a novel framework using k-sparse autoencoders to efficiently manipulate concepts in diffusion models during inference, enabling precise control over content generation without retraining while improving safety and speed.

Authors:Basant Sharma, Arun Kumar Singh
Title: Trajectory Optimization Under Stochastic Dynamics Leveraging Maximum Mean Discrepancy
Abstract:
This paper addresses sampling-based trajectory optimization for risk-aware navigation under stochastic dynamics. Typically such approaches operate by computing $\tilde{N}$ perturbed rollouts around the nominal dynamics to estimate the collision risk associated with a sequence of control commands. We consider a setting where it is expensive to estimate risk using perturbed rollouts, for example, due to expensive collision-checks. We put forward two key contributions. First, we develop an algorithm that distills the statistical information from a larger set of rollouts to a reduced-set with sample size $N<<\tilde{N}$. Consequently, we estimate collision risk using just $N$ rollouts instead of $\tilde{N}$. Second, we formulate a novel surrogate for the collision risk that can leverage the distilled statistical information contained in the reduced-set. We formalize both algorithmic contributions using distribution embedding in Reproducing Kernel Hilbert Space (RKHS) and Maximum Mean Discrepancy (MMD). We perform extensive benchmarking to demonstrate that our MMD-based approach leads to safer trajectories at low sample regime than existing baselines using Conditional Value-at Risk (CVaR) based collision risk estimate.
本文提出了一种基于采样的轨迹优化方法,通过使用RKHS和MMD将大量轨迹的风险信息提炼到小样本集合中,从而在减少计算成本的同时生成比CVaR方法更安全的轨迹。
This paper introduces a sampling-based trajectory optimization method that reduces computational cost by distilling risk information from a large set of rollouts to a smaller set using RKHS and MMD, resulting in safer trajectories with fewer samples compared to CVaR-based approaches.

Authors:Hongliang Li, Jiaxin Zhang, Wenhui Liao, Dezhi Peng, Kai Ding, Lianwen Jin
Title: RedundancyLens: Revealing and Exploiting Visual Token Processing Redundancy for Efficient Decoder-Only MLLMs
Abstract:
Current Multimodal Large Language Model (MLLM) architectures face a critical tradeoff between performance and efficiency: decoder-only architectures achieve higher performance but lower efficiency, while cross-attention-based architectures offer greater efficiency but lower performance. The key distinction lies in how visual tokens are processed. Decoder-only architectures apply self-attention and FFN operations on visual tokens, while cross-attention architectures skip these computations. To investigate whether redundancy exists in this computationally expensive process, we propose a training-free framework for analyzing trained MLLMs. It consists of Probe-Activated Dynamic FFN and Hollow Attention, which enable adjustable reductions in computations for visual tokens, as well as a Layer Ranking Algorithm that prioritizes layers for these reductions. Extensive experiments demonstrate substantial, structured, and clustered redundancy unique to decoder-only MLLMs, offering valuable insights for future MLLM architecture design. Furthermore, by leveraging our reduction framework as a training-free inference acceleration approach, we achieve performance comparable to or better than state-of-the-art methods while remaining compatible with them. Code will be publicly available at https://github.com/L-Hugh/RedundancyLens.
中文: 当前多模态大语言模型存在性能与效率的权衡,本研究提出无需训练的分析框架,发现仅解码器架构中存在显著冗余,并通过推理加速实现了可比甚至更优的性能。
English: Current MLLMs face a performance-efficiency tradeoff, with this study proposing a training-free framework that identifies substantial redundancy in decoder-only architectures and achieves comparable or better performance through inference acceleration.

Authors:Javier Montalvo, Pablo Carballeira, Álvaro García-Martín
Title: SynthmanticLiDAR: A Synthetic Dataset for Semantic Segmentation on LiDAR Imaging
Abstract:
Semantic segmentation on LiDAR imaging is increasingly gaining attention, as it can provide useful knowledge for perception systems and potential for autonomous driving. However, collecting and labeling real LiDAR data is an expensive and time-consuming task. While datasets such as SemanticKITTI have been manually collected and labeled, the introduction of simulation tools such as CARLA, has enabled the creation of synthetic datasets on demand. In this work, we present a modified CARLA simulator designed with LiDAR semantic segmentation in mind, with new classes, more consistent object labeling with their counterparts from real datasets such as SemanticKITTI, and the possibility to adjust the object class distribution. Using this tool, we have generated SynthmanticLiDAR, a synthetic dataset for semantic segmentation on LiDAR imaging, designed to be similar to SemanticKITTI, and we evaluate its contribution to the training process of different semantic segmentation algorithms by using a naive transfer learning approach. Our results show that incorporating SynthmanticLiDAR into the training process improves the overall performance of tested algorithms, proving the usefulness of our dataset, and therefore, our adapted CARLA simulator. The dataset and simulator are available in https://github.com/vpulab/SynthmanticLiDAR.
Chinese: 本研究推出了改进的CARLA模拟器,生成模拟SemanticKITTI的合成数据集SynthmanticLiDAR,通过迁移学习验证了该数据集能有效提升激光雷达语义分割算法的性能。
English: This study introduces a modified CARLA simulator that generates SynthmanticLiDAR, a synthetic dataset designed to mimic SemanticKITTI for LiDAR semantic segmentation, demonstrating improved algorithm performance through transfer learning.

Authors:Bo Lan, Pei Li, Jiaxi Yin, Yunpeng Song, Ge Wang, Han Ding, Jinsong Han, Fei Wang
Title: XRF V2: A Dataset for Action Summarization with Wi-Fi Signals, and IMUs in Phones, Watches, Earbuds, and Glasses
Abstract:
Human Action Recognition (HAR) plays a crucial role in applications such as health monitoring, smart home automation, and human-computer interaction. While HAR has been extensively studied, action summarization using Wi-Fi and IMU signals in smart-home environments , which involves identifying and summarizing continuous actions, remains an emerging task. This paper introduces the novel XRF V2 dataset, designed for indoor daily activity Temporal Action Localization (TAL) and action summarization. XRF V2 integrates multimodal data from Wi-Fi signals, IMU sensors (smartphones, smartwatches, headphones, and smart glasses), and synchronized video recordings, offering a diverse collection of indoor activities from 16 volunteers across three distinct environments. To tackle TAL and action summarization, we propose the XRFMamba neural network, which excels at capturing long-term dependencies in untrimmed sensory sequences and achieves the best performance with an average mAP of 78.74, outperforming the recent WiFiTAD by 5.49 points in mAP@avg while using 35% fewer parameters. In action summarization, we introduce a new metric, Response Meaning Consistency (RMC), to evaluate action summarization performance. And it achieves an average Response Meaning Consistency (mRMC) of 0.802. We envision XRF V2 as a valuable resource for advancing research in human action localization, action forecasting, pose estimation, multimodal foundation models pre-training, synthetic data generation, and more. The data and code are available at https://github.com/aiotgroup/XRFV2.
中文: 本文提出XRF V2数据集和XRFMamba神经网络,通过融合Wi-Fi和IMU多模态数据实现室内行为定位与摘要,在取得最优性能的同时引入了新的评估指标。
English: This paper introduces the XRF V2 dataset and XRFMamba neural network for indoor action localization and summarization, achieving state-of-the-art performance with multimodal Wi-Fi and IMU data while proposing a new evaluation metric.

Authors:Bangchao Wang, Yang Deng, Ruiqi Luo, Peng Liang, Tingting Bi
Title: MPLinker: Multi-template Prompt-tuning with Adversarial Training for Issue-commit Link Recovery
Abstract:
In recent years, the pre-training, prompting and prediction paradigm, known as prompt-tuning, has achieved significant success in Natural Language Processing (NLP). Issue-commit Link Recovery (ILR) in Software Traceability (ST) plays an important role in improving the reliability, quality, and security of software systems. The current ILR methods convert the ILR into a classification task using pre-trained language models (PLMs) and dedicated neural networks. these methods do not fully utilize the semantic information embedded in PLMs, resulting in not achieving acceptable performance. To address this limitation, we introduce a novel paradigm: Multi-template Prompt-tuning with adversarial training for issue-commit Link recovery (MPLinker). MPLinker redefines the ILR task as a cloze task via template-based prompt-tuning and incorporates adversarial training to enhance model generalization and reduce overfitting. We evaluated MPLinker on six open-source projects using a comprehensive set of performance metrics. The experiment results demonstrate that MPLinker achieves an average F1-score of 96.10%, Precision of 96.49%, Recall of 95.92%, MCC of 94.04%, AUC of 96.05%, and ACC of 98.15%, significantly outperforming existing state-of-the-art methods. Overall, MPLinker improves the performance and generalization of ILR models, and introduces innovative concepts and methods for ILR. The replication package for MPLinker is available at https://github.com/WTU-intelligent-software-development/MPLinker
中文: MPLinker通过基于模板的提示调优和对抗训练将问题-提交链接恢复重构为填空任务,在六个开源项目中以优越的性能指标显著超越现有方法。
English: MPLinker introduces a novel multi-template prompt-tuning approach with adversarial training to reformulate Issue-commit Link Recovery as a cloze task, significantly outperforming existing methods with superior performance metrics across six open-source projects.

Authors:Shiyu Fang, Donghao Zhou, Yiming Cui, ChengKai Xu, Peng Hang, Jian Sun
Title: Recognize then Resolve: A Hybrid Framework for Understanding Interaction and Cooperative Conflict Resolution in Mixed Traffic
Abstract:
A lack of understanding of interactions and the inability to effectively resolve conflicts continue to impede the progress of Connected Autonomous Vehicles (CAVs) in their interactions with Human-Driven Vehicles (HDVs). To address this challenge, we propose the Recognize then Resolve (RtR) framework. First, a Bilateral Intention Progression Graph (BIPG) is constructed based on CAV-HDV interaction data to model the evolution of interactions and identify potential HDV intentions. Three typical interaction breakdown scenarios are then categorized, and key moments are defined for triggering cooperative conflict resolution. On this basis, a constrained Monte Carlo Tree Search (MCTS) algorithm is introduced to determine the optimal passage order while accommodating HDV intentions. Experimental results demonstrate that the proposed RtR framework outperforms other cooperative approaches in terms of safety and efficiency across various penetration rates, achieving results close to consistent cooperation while significantly reducing computational resources. Our code and data are available at: https://github.com/FanGShiYuu/RtR-Recognize-then-Resolve/.
中文摘要:提出的"先识别后解决"框架通过构建交互意图图谱和约束蒙特卡洛树搜索算法,有效提升网联自动驾驶车辆与人工驾驶车辆的交互安全性和效率,同时大幅降低计算资源消耗。
English Summary: The Recognize then Resolve (RtR) framework addresses CAV-HDV interaction challenges by modeling intention progression and using constrained MCTS for conflict resolution, achieving superior safety and efficiency with reduced computational costs.

Authors:Yunpeng Qu, Kun Yuan, Jinhua Hao, Kai Zhao, Qizhi Xie, Ming Sun, Chao Zhou
Title: Visual Autoregressive Modeling for Image Super-Resolution
Abstract:
Image Super-Resolution (ISR) has seen significant progress with the introduction of remarkable generative models. However, challenges such as the trade-off issues between fidelity and realism, as well as computational complexity, have also posed limitations on their application. Building upon the tremendous success of autoregressive models in the language domain, we propose \textbf{VARSR}, a novel visual autoregressive modeling for ISR framework with the form of next-scale prediction. To effectively integrate and preserve semantic information in low-resolution images, we propose using prefix tokens to incorporate the condition. Scale-aligned Rotary Positional Encodings are introduced to capture spatial structures and the diffusion refiner is utilized for modeling quantization residual loss to achieve pixel-level fidelity. Image-based Classifier-free Guidance is proposed to guide the generation of more realistic images. Furthermore, we collect large-scale data and design a training process to obtain robust generative priors. Quantitative and qualitative results show that VARSR is capable of generating high-fidelity and high-realism images with more efficiency than diffusion-based methods. Our codes will be released at https://github.com/qyp2000/VARSR.
中文摘要:VARSR框架提出了一种基于视觉自回归模型的图像超分辨率方法,通过下一尺度预测、前缀标记和尺度对齐位置编码等创新技术,在保证高保真度和真实感的同时,比基于扩散的方法更高效地生成优质图像。
English Summary: The VARSR framework introduces a visual autoregressive model for image super-resolution that uses next-scale prediction and innovative techniques like prefix tokens and scale-aligned positional encodings to efficiently generate high-fidelity, realistic images while outperforming diffusion-based methods in computational efficiency.

Authors:Zhengrui Guo, Qichen Sun, Jiabo Ma, Lishuang Feng, Jinzhuo Wang, Hao Chen
Title: Context Matters: Query-aware Dynamic Long Sequence Modeling of Gigapixel Images
Abstract:
Whole slide image (WSI) analysis presents significant computational challenges due to the massive number of patches in gigapixel images. While transformer architectures excel at modeling long-range correlations through self-attention, their quadratic computational complexity makes them impractical for computational pathology applications. Existing solutions like local-global or linear self-attention reduce computational costs but compromise the strong modeling capabilities of full self-attention. In this work, we propose Querent, i.e., the query-aware long contextual dynamic modeling framework, which achieves a theoretically bounded approximation of full self-attention while delivering practical efficiency. Our method adaptively predicts which surrounding regions are most relevant for each patch, enabling focused yet unrestricted attention computation only with potentially important contexts. By using efficient region-wise metadata computation and importance estimation, our approach dramatically reduces computational overhead while preserving global perception to model fine-grained patch correlations. Through comprehensive experiments on biomarker prediction, gene mutation prediction, cancer subtyping, and survival analysis across over 10 WSI datasets, our method demonstrates superior performance compared to the state-of-the-art approaches. Codes are available at https://github.com/dddavid4real/Querent.
中文: 提出的Querent框架通过自适应聚焦全切片图像中的相关区域,有效逼近完整自注意力机制,在保持实际计算效率的同时,在多种计算病理学任务中展现出卓越性能。
English: The proposed Querent framework efficiently approximates full self-attention by adaptively focusing on relevant regions in whole slide images, achieving superior performance across multiple computational pathology tasks while maintaining practical computational efficiency.

Authors:Seungheun Baek, Soyon Park, Yan Ting Chok, Mogan Gim, Jaewoo Kang
Title: GPO-VAE: Modeling Explainable Gene Perturbation Responses utilizing GRN-Aligned Parameter Optimization
Abstract:
Motivation: Predicting cellular responses to genetic perturbations is essential for understanding biological systems and developing targeted therapeutic strategies. While variational autoencoders (VAEs) have shown promise in modeling perturbation responses, their limited explainability poses a significant challenge, as the learned features often lack clear biological meaning. Nevertheless, model explainability is one of the most important aspects in the realm of biological AI. One of the most effective ways to achieve explainability is incorporating the concept of gene regulatory networks (GRNs) in designing deep learning models such as VAEs. GRNs elicit the underlying causal relationships between genes and are capable of explaining the transcriptional responses caused by genetic perturbation treatments. Results: We propose GPO-VAE, an explainable VAE enhanced by GRN-aligned Parameter Optimization that explicitly models gene regulatory networks in the latent space. Our key approach is to optimize the learnable parameters related to latent perturbation effects towards GRN-aligned explainability. Experimental results on perturbation prediction show our model achieves state-of-the-art performance in predicting transcriptional responses across multiple benchmark datasets. Furthermore, additional results on evaluating the GRN inference task reveal our model's ability to generate meaningful GRNs compared to other methods. According to qualitative analysis, GPO-VAE posseses the ability to construct biologically explainable GRNs that align with experimentally validated regulatory pathways. GPO-VAE is available at https://github.com/dmis-lab/GPO-VAE
Chinese: GPO-VAE模型通过将基因调控网络整合到变分自编码器中,显著提升了模型的可解释性,在遗传扰动预测任务中表现优异,并能生成与实验验证通路一致的生物学调控网络。
English: The GPO-VAE model integrates gene regulatory networks into a variational autoencoder framework to enhance explainability and achieves state-of-the-art performance in predicting cellular responses to genetic perturbations while generating biologically meaningful regulatory networks.

Authors:Shenghao Fu, Qize Yang, Qijie Mo, Junkai Yan, Xihan Wei, Jingke Meng, Xiaohua Xie, Wei-Shi Zheng
Title: LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models
Abstract:
Recent open-vocabulary detectors achieve promising performance with abundant region-level annotated data. In this work, we show that an open-vocabulary detector co-training with a large language model by generating image-level detailed captions for each image can further improve performance. To achieve the goal, we first collect a dataset, GroundingCap-1M, wherein each image is accompanied by associated grounding labels and an image-level detailed caption. With this dataset, we finetune an open-vocabulary detector with training objectives including a standard grounding loss and a caption generation loss. We take advantage of a large language model to generate both region-level short captions for each region of interest and image-level long captions for the whole image. Under the supervision of the large language model, the resulting detector, LLMDet, outperforms the baseline by a clear margin, enjoying superior open-vocabulary ability. Further, we show that the improved LLMDet can in turn build a stronger large multi-modal model, achieving mutual benefits. The code, model, and dataset is available at https://github.com/iSEE-Laboratory/LLMDet.
中文摘要:LLMDet模型通过与大语言模型协同训练,利用GroundingCap-1M数据集生成图像级描述,显著提升了开放词汇检测性能,并能构建更强大的多模态系统。
English Summary: The LLMDet model enhances open-vocabulary detection by co-training with a large language model using the GroundingCap-1M dataset, achieving superior performance and enabling stronger multimodal applications.

Authors:Anh Bui, Trang Vu, Long Vuong, Trung Le, Paul Montague, Tamas Abraham, Junae Kim, Dinh Phung
Title: Fantastic Targets for Concept Erasure in Diffusion Models and Where To Find Them
Abstract:
Concept erasure has emerged as a promising technique for mitigating the risk of harmful content generation in diffusion models by selectively unlearning undesirable concepts. The common principle of previous works to remove a specific concept is to map it to a fixed generic concept, such as a neutral concept or just an empty text prompt. In this paper, we demonstrate that this fixed-target strategy is suboptimal, as it fails to account for the impact of erasing one concept on the others. To address this limitation, we model the concept space as a graph and empirically analyze the effects of erasing one concept on the remaining concepts. Our analysis uncovers intriguing geometric properties of the concept space, where the influence of erasing a concept is confined to a local region. Building on this insight, we propose the Adaptive Guided Erasure (AGE) method, which \emph{dynamically} selects optimal target concepts tailored to each undesirable concept, minimizing unintended side effects. Experimental results show that AGE significantly outperforms state-of-the-art erasure methods on preserving unrelated concepts while maintaining effective erasure performance. Our code is published at {https://github.com/tuananhbui89/Adaptive-Guided-Erasure}.
Chinese: 本文提出自适应引导擦除方法(AGE),通过动态选择最优目标概念来消除扩散模型中的不良概念,有效减少对其他概念的副作用,并在实验中显著优于现有擦除技术。
English: This paper introduces Adaptive Guided Erasure (AGE), a method that dynamically selects optimal target concepts for erasing undesirable ones in diffusion models, minimizing side effects on unrelated concepts and outperforming existing techniques.

Authors:Minwoo Jung, Sangwoo Jung, Hyeonjae Gil, Ayoung Kim
Title: HeLiOS: Heterogeneous LiDAR Place Recognition via Overlap-based Learning and Local Spherical Transformer
Abstract:
LiDAR place recognition is a crucial module in localization that matches the current location with previously observed environments. Most existing approaches in LiDAR place recognition dominantly focus on the spinning type LiDAR to exploit its large FOV for matching. However, with the recent emergence of various LiDAR types, the importance of matching data across different LiDAR types has grown significantly-a challenge that has been largely overlooked for many years. To address these challenges, we introduce HeLiOS, a deep network tailored for heterogeneous LiDAR place recognition, which utilizes small local windows with spherical transformers and optimal transport-based cluster assignment for robust global descriptors. Our overlap-based data mining and guided-triplet loss overcome the limitations of traditional distance-based mining and discrete class constraints. HeLiOS is validated on public datasets, demonstrating performance in heterogeneous LiDAR place recognition while including an evaluation for long-term recognition, showcasing its ability to handle unseen LiDAR types. We release the HeLiOS code as an open source for the robotics community at https://github.com/minwoo0611/HeLiOS.
中文摘要:HeLiOS是一种专为异构LiDAR位置识别设计的深度网络,通过球形变换器和最优传输构建鲁棒的全局描述符,在公开数据集上验证有效并已开源发布。
English Summary: HeLiOS is a deep network designed for heterogeneous LiDAR place recognition, using spherical transformers and optimal transport to create robust global descriptors, validated on public datasets and released as open source.

Authors:Zi-Jian Cheng, Zi-Yi Jia, Zhi Zhou, Yu-Feng Li, Lan-Zhe Guo
Title: TabFSBench: Tabular Benchmark for Feature Shifts in Open Environments
Abstract:
Tabular data is widely utilized in various machine learning tasks. Current tabular learning research predominantly focuses on closed environments, while in real-world applications, open environments are often encountered, where distribution and feature shifts occur, leading to significant degradation in model performance. Previous research has primarily concentrated on mitigating distribution shifts, whereas feature shifts, a distinctive and unexplored challenge of tabular data, have garnered limited attention. To this end, this paper conducts the first comprehensive study on feature shifts in tabular data and introduces the first tabular feature-shift benchmark (TabFSBench). TabFSBench evaluates impacts of four distinct feature-shift scenarios on four tabular model categories across various datasets and assesses the performance of large language models (LLMs) and tabular LLMs in the tabular benchmark for the first time. Our study demonstrates three main observations: (1) most tabular models have the limited applicability in feature-shift scenarios; (2) the shifted feature set importance has a linear relationship with model performance degradation; (3) model performance in closed environments correlates with feature-shift performance. Future research direction is also explored for each observation. Benchmark: https://github.com/LAMDASZ-ML/TabFSBench.
中文摘要:本文首次对表格数据中的特征偏移问题进行全面研究,提出TabFSBench基准评估不同特征偏移场景下的模型表现,揭示了模型适用性局限及性能相关性等关键发现。
English Summary: This paper presents the first comprehensive study on feature shifts in tabular data, introducing TabFSBench to evaluate model performance across various feature-shift scenarios and revealing key findings about model limitations and performance correlations.

Authors:Wencheng Yang, Song Wang, Di Wu, Taotao Cai, Yanming Zhu, Shicheng Wei, Yiying Zhang, Xu Yang, Zhaohui Tang, Yan Li
Title: Deep Learning Model Inversion Attacks and Defenses: A Comprehensive Survey
Abstract:
The rapid adoption of deep learning in sensitive domains has brought tremendous benefits. However, this widespread adoption has also given rise to serious vulnerabilities, particularly model inversion (MI) attacks, posing a significant threat to the privacy and integrity of personal data. The increasing prevalence of these attacks in applications such as biometrics, healthcare, and finance has created an urgent need to understand their mechanisms, impacts, and defense methods. This survey aims to fill the gap in the literature by providing a structured and in-depth review of MI attacks and defense strategies. Our contributions include a systematic taxonomy of MI attacks, extensive research on attack techniques and defense mechanisms, and a discussion about the challenges and future research directions in this evolving field. By exploring the technical and ethical implications of MI attacks, this survey aims to offer insights into the impact of AI-powered systems on privacy, security, and trust. In conjunction with this survey, we have developed a comprehensive repository to support research on MI attacks and defenses. The repository includes state-of-the-art research papers, datasets, evaluation metrics, and other resources to meet the needs of both novice and experienced researchers interested in MI attacks and defenses, as well as the broader field of AI security and privacy. The repository will be continuously maintained to ensure its relevance and utility. It is accessible at https://github.com/overgter/Deep-Learning-Model-Inversion-Attacks-and-Defenses.
中文: 本综述系统梳理了深度学习中的模型反转攻击与防御策略,探讨其机制、影响及未来研究方向,并提供了持续更新的资源库以支持相关研究。
English: This survey provides a structured review of model inversion attacks and defense strategies in deep learning, addressing their mechanisms, impacts, and future research directions while offering a comprehensive repository for ongoing study.

Authors:Sunyong Seo, Huisu Yoon, Semin Kim, Jongha Lee
Title: Full-scale Representation Guided Network for Retinal Vessel Segmentation
Abstract:
The U-Net architecture and its variants have remained state-of-the-art (SOTA) for retinal vessel segmentation over the past decade. In this study, we introduce a Full Scale Guided Network (FSG-Net), where the feature representation network with modernized convolution blocks extracts full-scale information and the guided convolution block refines that information. Attention-guided filter is introduced to the guided convolution block under the interpretation that the filter behaves like the unsharp mask filter. Passing full-scale information to the attention block allows for the generation of improved attention maps, which are then passed to the attention-guided filter, resulting in performance enhancement of the segmentation network. The structure preceding the guided convolution block can be replaced by any U-Net variant, which enhances the scalability of the proposed approach. For a fair comparison, we re-implemented recent studies available in public repositories to evaluate their scalability and reproducibility. Our experiments also show that the proposed network demonstrates competitive results compared to current SOTA models on various public datasets. Ablation studies demonstrate that the proposed model is competitive with much smaller parameter sizes. Lastly, by applying the proposed model to facial wrinkle segmentation, we confirmed the potential for scalability to similar tasks in other domains. Our code is available on https://github.com/ZombaSY/FSG-Net-pytorch.
中文: 本研究提出FSG-Net网络,通过利用全尺度信息和注意力引导滤波器来改进视网膜血管分割,在参数更少的情况下实现了与先进模型相媲美的性能,并展现出在面部皱纹分割等其他领域的可扩展性。
English: The study introduces FSG-Net, a novel network that enhances retinal vessel segmentation by utilizing full-scale information and attention-guided filters, achieving competitive performance with fewer parameters and demonstrating scalability to other domains like facial wrinkle segmentation.

Authors:Tongda Xu, Xiyan Cai, Xinjie Zhang, Xingtong Ge, Dailan He, Ming Sun, Jingjing Liu, Ya-Qin Zhang, Jian Li, Yan Wang
Title: Rethinking Diffusion Posterior Sampling: From Conditional Score Estimator to Maximizing a Posterior
Abstract:
Recent advancements in diffusion models have been leveraged to address inverse problems without additional training, and Diffusion Posterior Sampling (DPS) (Chung et al., 2022a) is among the most popular approaches. Previous analyses suggest that DPS accomplishes posterior sampling by approximating the conditional score. While in this paper, we demonstrate that the conditional score approximation employed by DPS is not as effective as previously assumed, but rather aligns more closely with the principle of maximizing a posterior (MAP). This assertion is substantiated through an examination of DPS on 512x512 ImageNet images, revealing that: 1) DPS's conditional score estimation significantly diverges from the score of a well-trained conditional diffusion model and is even inferior to the unconditional score; 2) The mean of DPS's conditional score estimation deviates significantly from zero, rendering it an invalid score estimation; 3) DPS generates high-quality samples with significantly lower diversity. In light of the above findings, we posit that DPS more closely resembles MAP than a conditional score estimator, and accordingly propose the following enhancements to DPS: 1) we explicitly maximize the posterior through multi-step gradient ascent and projection; 2) we utilize a light-weighted conditional score estimator trained with only 100 images and 8 GPU hours. Extensive experimental results indicate that these proposed improvements significantly enhance DPS's performance. The source code for these improvements is provided in https://github.com/tongdaxu/Rethinking-Diffusion-Posterior-Sampling-From-Conditional-Score-Estimator-to-Maximizing-a-Posterior.
中文: 本研究发现扩散后验采样(DPS)更接近最大化后验而非精确逼近条件得分,并提出通过显式后验最大化与轻量级条件得分估计器的改进方案,显著提升了其性能表现。
English: This study reveals that Diffusion Posterior Sampling (DPS) aligns more closely with maximizing a posterior rather than accurately approximating the conditional score, and proposes enhancements including explicit posterior maximization and a lightweight conditional score estimator to significantly improve its performance.

Authors:Jaesin Ahn, Heechul Jung
Title: Distorting Embedding Space for Safety: A Defense Mechanism for Adversarially Robust Diffusion Models
Abstract:
Text-to-image diffusion models show remarkable generation performance following text prompts, but risk generating Not Safe For Work (NSFW) contents from unsafe prompts. Existing approaches, such as prompt filtering or concept unlearning, fail to defend against adversarial attacks while maintaining benign image quality. In this paper, we propose a novel approach called Distorting Embedding Space (DES), a text encoder-based defense mechanism that effectively tackles these issues through innovative embedding space control. DES transforms unsafe embeddings, extracted from a text encoder using unsafe prompts, toward carefully calculated safe embedding regions to prevent unsafe contents generation, while reproducing the original safe embeddings. DES also neutralizes the nudity embedding, extracted using prompt ``nudity", by aligning it with neutral embedding to enhance robustness against adversarial attacks. These methods ensure both robust defense and high-quality image generation. Additionally, DES can be adopted in a plug-and-play manner and requires zero inference overhead, facilitating its deployment. Extensive experiments on diverse attack types, including black-box and white-box scenarios, demonstrate DES's state-of-the-art performance in both defense capability and benign image generation quality. Our model is available at https://github.com/aei13/DES.
中文:提出的扭曲嵌入空间(DES)方法通过将不安全文本嵌入转换为安全区域,有效防止生成不良内容,同时保持高质量良性图像输出并增强对抗攻击的防御能力。
English: The proposed Distorting Embedding Space (DES) method effectively prevents NSFW content generation by transforming unsafe text embeddings into safe regions while maintaining high-quality benign image output and robust defense against adversarial attacks.

Authors:Antoine Simoulin, Namyong Park, Xiaoyi Liu, Grey Yang
Title: Memory-Efficient Fine-Tuning of Transformers via Token Selection
Abstract:
Fine-tuning provides an effective means to specialize pre-trained models for various downstream tasks. However, fine-tuning often incurs high memory overhead, especially for large transformer-based models, such as LLMs. While existing methods may reduce certain parts of the memory required for fine-tuning, they still require caching all intermediate activations computed in the forward pass to update weights during the backward pass. In this work, we develop TokenTune, a method to reduce memory usage, specifically the memory to store intermediate activations, in the fine-tuning of transformer-based models. During the backward pass, TokenTune approximates the gradient computation by backpropagating through just a subset of input tokens. Thus, with TokenTune, only a subset of intermediate activations are cached during the forward pass. Also, TokenTune can be easily combined with existing methods like LoRA, further reducing the memory cost. We evaluate our approach on pre-trained transformer models with up to billions of parameters, considering the performance on multiple downstream tasks such as text classification and question answering in a few-shot learning setup. Overall, TokenTune achieves performance on par with full fine-tuning or representative memory-efficient fine-tuning methods, while greatly reducing the memory footprint, especially when combined with other methods with complementary memory reduction mechanisms. We hope that our approach will facilitate the fine-tuning of large transformers, in specializing them for specific domains or co-training them with other neural components from a larger system. Our code is available at https://github.com/facebookresearch/tokentune.
Chinese: TokenTune是一种针对Transformer模型的高效微调方法,通过在反向传播中仅处理部分输入标记来减少激活内存占用,在保持与全参数微调相当性能的同时显著降低了内存消耗。
English: TokenTune is a memory-efficient fine-tuning method for transformer models that reduces activation memory by processing only a subset of tokens during backward passes, achieving comparable performance to full fine-tuning while significantly lowering memory usage.

Authors:Ervin Dervishaj, Tuukka Ruotsalo, Maria Maistro, Christina Lioma
Title: Are Representation Disentanglement and Interpretability Linked in Recommendation Models? A Critical Review and Reproducibility Study
Abstract:
Unsupervised learning of disentangled representations has been closely tied to enhancing the representation intepretability of Recommender Systems (RSs). This has been achieved by making the representation of individual features more distinctly separated, so that it is easier to attribute the contribution of features to the model's predictions. However, such advantages in interpretability and feature attribution have mainly been explored qualitatively. Moreover, the effect of disentanglement on the model's recommendation performance has been largely overlooked. In this work, we reproduce the recommendation performance, representation disentanglement and representation interpretability of five well-known recommendation models on four RS datasets. We quantify disentanglement and investigate the link of disentanglement with recommendation effectiveness and representation interpretability. While several existing work in RSs have proposed disentangled representations as a gateway to improved effectiveness and interpretability, our findings show that disentanglement is not necessarily related to effectiveness but is closely related to representation interpretability. Our code and results are publicly available at https://github.com/edervishaj/disentanglement-interpretability-recsys.
Chinese: 研究表明,推荐系统中的解耦表示虽能提升表征可解释性,但与推荐效果并无必然关联,这一结论通过对五个模型在四个数据集上的量化分析得到验证。
English: Disentangled representations in recommender systems are found to enhance interpretability but do not necessarily improve recommendation effectiveness, as demonstrated through quantitative analysis of five models on four datasets.

Authors:Zehong Wang, Zheyuan Zhang, Tianyi Ma, Nitesh V Chawla, Chuxu Zhang, Yanfang Ye
Title: Beyond Message Passing: Neural Graph Pattern Machine
Abstract:
Graph learning tasks often hinge on identifying key substructure patterns -- such as triadic closures in social networks or benzene rings in molecular graphs -- that underpin downstream performance. However, most existing graph neural networks (GNNs) rely on message passing, which aggregates local neighborhood information iteratively and struggles to explicitly capture such fundamental motifs, like triangles, k-cliques, and rings. This limitation hinders both expressiveness and long-range dependency modeling. In this paper, we introduce the Neural Graph Pattern Machine (GPM), a novel framework that bypasses message passing by learning directly from graph substructures. GPM efficiently extracts, encodes, and prioritizes task-relevant graph patterns, offering greater expressivity and improved ability to capture long-range dependencies. Empirical evaluations across four standard tasks -- node classification, link prediction, graph classification, and graph regression -- demonstrate that GPM outperforms state-of-the-art baselines. Further analysis reveals that GPM exhibits strong out-of-distribution generalization, desirable scalability, and enhanced interpretability. Code and datasets are available at: https://github.com/Zehong-Wang/GPM.
中文摘要:神经图模式机(GPM)框架通过直接从图子结构学习,克服了传统图神经网络的局限性,在多项任务中实现更优性能,同时展现出更强的表达能力和可解释性。
English Summary: The Neural Graph Pattern Machine (GPM) framework overcomes limitations of traditional graph neural networks by directly learning from key graph substructures, achieving superior performance across multiple tasks while demonstrating enhanced expressivity and interpretability.

Authors:Zhe Wang, Yuhua Ru, Fabian Bauer, Aladine Chetouani, Fang Chen, Liping Zhang, Didier Hans, Rachid Jennane, Mohamed Jarraya, Yung Hsin Chen
Title: Distillation-Driven Diffusion Model for Multi-Scale MRI Super-Resolution: Make 1.5T MRI Great Again
Abstract:
Magnetic Resonance Imaging (MRI) offers critical insights into microstructural details, however, the spatial resolution of standard 1.5T imaging systems is often limited. In contrast, 7T MRI provides significantly enhanced spatial resolution, enabling finer visualization of anatomical structures. Though this, the high cost and limited availability of 7T MRI hinder its widespread use in clinical settings. To address this challenge, a novel Super-Resolution (SR) model is proposed to generate 7T-like MRI from standard 1.5T MRI scans. Our approach leverages a diffusion-based architecture, incorporating gradient nonlinearity correction and bias field correction data from 7T imaging as guidance. Moreover, to improve deployability, a progressive distillation strategy is introduced. Specifically, the student model refines the 7T SR task with steps, leveraging feature maps from the inference phase of the teacher model as guidance, aiming to allow the student model to achieve progressively 7T SR performance with a smaller, deployable model size. Experimental results demonstrate that our baseline teacher model achieves state-of-the-art SR performance. The student model, while lightweight, sacrifices minimal performance. Furthermore, the student model is capable of accepting MRI inputs at varying resolutions without the need for retraining, significantly further enhancing deployment flexibility. The clinical relevance of our proposed method is validated using clinical data from Massachusetts General Hospital. Our code is available at https://github.com/ZWang78/SR.
中文: 本文提出了一种基于扩散模型的新型超分辨率方法,可将标准1.5T磁共振图像增强至类似7T的高分辨率,其轻量化学生模型在保持优异性能的同时,无需重新训练即可适应不同分辨率的输入,显著提升了临床部署灵活性。
English: A novel diffusion-based super-resolution model is proposed to generate high-resolution 7T-like MRI from standard 1.5T scans, achieving state-of-the-art performance with a lightweight student model that maintains flexibility across varying input resolutions without retraining.

Authors:Harshwardhan Praveen, Jacob Brown, Christopher Earls
Title: chebgreen: Learning and Interpolating Continuous Empirical Green's Functions from Data
Abstract:
In this work, we present a mesh-independent, data-driven library, chebgreen, to mathematically model one-dimensional systems, possessing an associated control parameter, and whose governing partial differential equation is unknown. The proposed method learns an Empirical Green's Function for the associated, but hidden, boundary value problem, in the form of a Rational Neural Network from which we subsequently construct a bivariate representation in a Chebyshev basis. We uncover the Green's function, at an unseen control parameter value, by interpolating the left and right singular functions within a suitable library, expressed as points on a manifold of Quasimatrices, while the associated singular values are interpolated with Lagrange polynomials.
中文摘要:本研究提出了chebgreen这一与网格无关的数据驱动库,通过有理神经网络和切比雪夫基表示,从数据中学习经验格林函数来建模控制方程未知的一维系统。
English Summary: The study introduces chebgreen, a mesh-independent data-driven library that models one-dimensional systems with unknown governing equations by learning Empirical Green's Functions through Rational Neural Networks and Chebyshev basis representations.

Authors:Ranjan Sapkota, Shaina Raza, Maged Shoman, Achyut Paudel, Manoj Karkee
Title: Multimodal Large Language Models for Image, Text, and Speech Data Augmentation: A Survey
Abstract:
In the past five years, research has shifted from traditional Machine Learning (ML) and Deep Learning (DL) approaches to leveraging Large Language Models (LLMs) , including multimodality, for data augmentation to enhance generalization, and combat overfitting in training deep convolutional neural networks. However, while existing surveys predominantly focus on ML and DL techniques or limited modalities (text or images), a gap remains in addressing the latest advancements and multi-modal applications of LLM-based methods. This survey fills that gap by exploring recent literature utilizing multimodal LLMs to augment image, text, and audio data, offering a comprehensive understanding of these processes. We outlined various methods employed in the LLM-based image, text and speech augmentation, and discussed the limitations identified in current approaches. Additionally, we identified potential solutions to these limitations from the literature to enhance the efficacy of data augmentation practices using multimodal LLMs. This survey serves as a foundation for future research, aiming to refine and expand the use of multimodal LLMs in enhancing dataset quality and diversity for deep learning applications. (Surveyed Paper GitHub Repo: https://github.com/WSUAgRobotics/data-aug-multi-modal-llm. Keywords: LLM data augmentation, Grok text data augmentation, DeepSeek image data augmentation, Grok speech data augmentation, GPT audio augmentation, voice augmentation, DeepSeek for data augmentation, DeepSeek R1 text data augmentation, DeepSeek R1 image augmentation, Image Augmentation using LLM, Text Augmentation using LLM, LLM data augmentation for deep learning applications)
中文摘要:本综述填补了现有研究的空白,全面探讨了多模态大语言模型在图像、文本和音频数据增强方面的最新进展,旨在提升深度学习应用的数据质量和多样性。
English Summary: This survey addresses the gap in existing literature by comprehensively exploring recent advancements in multimodal Large Language Models for augmenting image, text, and audio data to enhance deep learning applications.

Authors:Daniel Schwartz, Dmitriy Bespalov, Zhe Wang, Ninad Kulkarni, Yanjun Qi
Title: Graph of Attacks with Pruning: Optimizing Stealthy Jailbreak Prompt Generation for Enhanced LLM Content Moderation
Abstract:
As large language models (LLMs) become increasingly prevalent, ensuring their robustness against adversarial misuse is crucial. This paper introduces the GAP (Graph of Attacks with Pruning) framework, an advanced approach for generating stealthy jailbreak prompts to evaluate and enhance LLM safeguards. GAP addresses limitations in existing tree-based LLM jailbreak methods by implementing an interconnected graph structure that enables knowledge sharing across attack paths. Our experimental evaluation demonstrates GAP's superiority over existing techniques, achieving a 20.8% increase in attack success rates while reducing query costs by 62.7%. GAP consistently outperforms state-of-the-art methods for attacking both open and closed LLMs, with attack success rates of >96%. Additionally, we present specialized variants like GAP-Auto for automated seed generation and GAP-VLM for multimodal attacks. GAP-generated prompts prove highly effective in improving content moderation systems, increasing true positive detection rates by 108.5% and accuracy by 183.6% when used for fine-tuning. Our implementation is available at https://github.com/dsbuddy/GAP-LLM-Safety.
中文: 本文提出的GAP框架采用互联图结构生成隐蔽的越狱提示,在将查询成本降低62.7%的同时使攻击成功率提升20.8%,用于微调时还能将内容审核系统的检测准确率提升183.6%。
English: This paper presents the GAP framework, which uses an interconnected graph structure to generate stealthy jailbreak prompts, significantly improving attack success rates by 20.8% while reducing query costs by 62.7% and enhancing content moderation systems when used for fine-tuning.

Authors:Xun Liang, Simin Niu, Zhiyu Li, Sensen Zhang, Hanyu Wang, Feiyu Xiong, Jason Zhaoxin Fan, Bo Tang, Shichao Song, Mengwei Wang, Jiawei Yang
Title: SafeRAG: Benchmarking Security in Retrieval-Augmented Generation of Large Language Model
Abstract:
The indexing-retrieval-generation paradigm of retrieval-augmented generation (RAG) has been highly successful in solving knowledge-intensive tasks by integrating external knowledge into large language models (LLMs). However, the incorporation of external and unverified knowledge increases the vulnerability of LLMs because attackers can perform attack tasks by manipulating knowledge. In this paper, we introduce a benchmark named SafeRAG designed to evaluate the RAG security. First, we classify attack tasks into silver noise, inter-context conflict, soft ad, and white Denial-of-Service. Next, we construct RAG security evaluation dataset (i.e., SafeRAG dataset) primarily manually for each task. We then utilize the SafeRAG dataset to simulate various attack scenarios that RAG may encounter. Experiments conducted on 14 representative RAG components demonstrate that RAG exhibits significant vulnerability to all attack tasks and even the most apparent attack task can easily bypass existing retrievers, filters, or advanced LLMs, resulting in the degradation of RAG service quality. Code is available at: https://github.com/IAAR-Shanghai/SafeRAG.
中文摘要:SafeRAG基准测试评估RAG系统对四类攻击的脆弱性,发现即使基础攻击也能绕过现有防御机制导致服务质量下降。
English Summary: The SafeRAG benchmark evaluates RAG system vulnerabilities to four types of attacks, revealing that even basic attacks can bypass existing defenses and degrade service quality.

Authors:Xiangbo Gao, Runsheng Xu, Jiachen Li, Ziran Wang, Zhiwen Fan, Zhengzhong Tu
Title: STAMP: Scalable Task And Model-agnostic Collaborative Perception
Abstract:
Perception is crucial for autonomous driving, but single-agent perception is often constrained by sensors' physical limitations, leading to degraded performance under severe occlusion, adverse weather conditions, and when detecting distant objects. Multi-agent collaborative perception offers a solution, yet challenges arise when integrating heterogeneous agents with varying model architectures. To address these challenges, we propose STAMP, a scalable task- and model-agnostic, collaborative perception pipeline for heterogeneous agents. STAMP utilizes lightweight adapter-reverter pairs to transform Bird's Eye View (BEV) features between agent-specific and shared protocol domains, enabling efficient feature sharing and fusion. This approach minimizes computational overhead, enhances scalability, and preserves model security. Experiments on simulated and real-world datasets demonstrate STAMP's comparable or superior accuracy to state-of-the-art models with significantly reduced computational costs. As a first-of-its-kind task- and model-agnostic framework, STAMP aims to advance research in scalable and secure mobility systems towards Level 5 autonomy. Our project page is at https://xiangbogaobarry.github.io/STAMP and the code is available at https://github.com/taco-group/STAMP.
中文: STAMP是一种可扩展、任务与模型无关的异构智能体协同感知框架,通过轻量级适配器-还原器对实现高效特征共享与融合,在显著降低计算成本的同时保持高精度和模型安全性。
English: STAMP is a scalable, task- and model-agnostic collaborative perception pipeline for heterogeneous agents that uses lightweight adapter-reverter pairs to enable efficient feature sharing and fusion, achieving high accuracy with low computational costs and enhanced security.

Authors:Vishal Thengane, Xiatian Zhu, Salim Bouzerdoum, Son Lam Phung, Yunpeng Li
Title: Foundational Models for 3D Point Clouds: A Survey and Outlook
Abstract:
The 3D point cloud representation plays a crucial role in preserving the geometric fidelity of the physical world, enabling more accurate complex 3D environments. While humans naturally comprehend the intricate relationships between objects and variations through a multisensory system, artificial intelligence (AI) systems have yet to fully replicate this capacity. To bridge this gap, it becomes essential to incorporate multiple modalities. Models that can seamlessly integrate and reason across these modalities are known as foundation models (FMs). The development of FMs for 2D modalities, such as images and text, has seen significant progress, driven by the abundant availability of large-scale datasets. However, the 3D domain has lagged due to the scarcity of labelled data and high computational overheads. In response, recent research has begun to explore the potential of applying FMs to 3D tasks, overcoming these challenges by leveraging existing 2D knowledge. Additionally, language, with its capacity for abstract reasoning and description of the environment, offers a promising avenue for enhancing 3D understanding through large pre-trained language models (LLMs). Despite the rapid development and adoption of FMs for 3D vision tasks in recent years, there remains a gap in comprehensive and in-depth literature reviews. This article aims to address this gap by presenting a comprehensive overview of the state-of-the-art methods that utilize FMs for 3D visual understanding. We start by reviewing various strategies employed in the building of various 3D FMs. Then we categorize and summarize use of different FMs for tasks such as perception tasks. Finally, the article offers insights into future directions for research and development in this field. To help reader, we have curated list of relevant papers on the topic: https://github.com/vgthengane/Awesome-FMs-in-3D.
中文摘要:本文全面综述了用于三维视觉理解的基础模型,通过分析其构建策略、在感知任务中的应用及未来研究方向,填补了该领域文献综述的空白。
English Summary: This article provides a comprehensive review of foundation models (FMs) for 3D visual understanding, addressing the gap in literature by examining their development strategies, applications in perception tasks, and future research directions.

Authors:Hao Dong, Moru Liu, Kaiyang Zhou, Eleni Chatzi, Juho Kannala, Cyrill Stachniss, Olga Fink
Title: Advances in Multimodal Adaptation and Generalization: From Traditional Approaches to Foundation Models
Abstract:
In real-world scenarios, achieving domain adaptation and generalization poses significant challenges, as models must adapt to or generalize across unknown target distributions. Extending these capabilities to unseen multimodal distributions, i.e., multimodal domain adaptation and generalization, is even more challenging due to the distinct characteristics of different modalities. Significant progress has been made over the years, with applications ranging from action recognition to semantic segmentation. Besides, the recent advent of large-scale pre-trained multimodal foundation models, such as CLIP, has inspired works leveraging these models to enhance adaptation and generalization performances or adapting them to downstream tasks. This survey provides the first comprehensive review of recent advances from traditional approaches to foundation models, covering: (1) Multimodal domain adaptation; (2) Multimodal test-time adaptation; (3) Multimodal domain generalization; (4) Domain adaptation and generalization with the help of multimodal foundation models; and (5) Adaptation of multimodal foundation models. For each topic, we formally define the problem and thoroughly review existing methods. Additionally, we analyze relevant datasets and applications, highlighting open challenges and potential future research directions. We maintain an active repository that contains up-to-date literature at https://github.com/donghao51/Awesome-Multimodal-Adaptation.
中文: 本综述全面回顾了多模态领域自适应与泛化的研究进展,涵盖从传统方法到基础模型的各类方法,并分析了关键挑战、应用场景及未来研究方向。
English: This survey comprehensively reviews multimodal domain adaptation and generalization, covering traditional methods to foundation models, analyzing key challenges, applications, and future research directions.

Authors:Matthieu Barreau, Haoming Shen
Title: Accuracy and Robustness of Weight-Balancing Methods for Training PINNs
Abstract:
Physics-Informed Neural Networks (PINNs) have emerged as powerful tools for integrating physics-based models with data by minimizing both data and physics losses. However, this multi-objective optimization problem is notoriously challenging, with some benchmark problems leading to unfeasible solutions. To address these issues, various strategies have been proposed, including adaptive weight adjustments in the loss function. In this work, we introduce clear definitions of accuracy and robustness in the context of PINNs and propose a novel training algorithm based on the Primal-Dual (PD) optimization framework. Our approach enhances the robustness of PINNs while maintaining comparable performance to existing weight-balancing methods. Numerical experiments demonstrate that the PD method consistently achieves reliable solutions across all investigated cases, even in the low-data regime, and can be easily implemented, facilitating its practical adoption. The code is available at https://github.com/haoming-SHEN/Accuracy-and-Robustness-of-Weight-Balancing-Methods-for-Training-PINNs.git.
中文: 本文提出了一种基于原始-对偶优化框架的新训练算法,增强了物理信息神经网络的鲁棒性,在保持与现有权重平衡方法相当性能的同时,实现了所有测试场景下的可靠求解。
English: This paper introduces a Primal-Dual optimization framework to enhance the robustness of Physics-Informed Neural Networks (PINNs), achieving reliable performance across various scenarios while maintaining accuracy comparable to existing methods.

Authors:Anmol Goel, Yaxi Hu, Iryna Gurevych, Amartya Sanyal
Title: Differentially Private Steering for Large Language Model Alignment
Abstract:
Aligning Large Language Models (LLMs) with human values and away from undesirable behaviors (such as hallucination) has become increasingly important. Recently, steering LLMs towards a desired behavior via activation editing has emerged as an effective method to mitigate harmful generations at inference-time. Activation editing modifies LLM representations by preserving information from positive demonstrations (e.g., truthful) and minimising information from negative demonstrations (e.g., hallucinations). When these demonstrations come from a private dataset, the aligned LLM may leak private information contained in those private samples. In this work, we present the first study of aligning LLM behavior with private datasets. Our work proposes the Private Steering for LLM Alignment (PSA) algorithm to edit LLM activations with differential privacy (DP) guarantees. We conduct extensive experiments on seven different benchmarks with open-source LLMs of different sizes (0.5B to 7B) and model families (LlaMa, Qwen, Mistral and Gemma). Our results show that PSA achieves DP guarantees for LLM alignment with minimal loss in performance, including alignment metrics, open-ended text generation quality, and general-purpose reasoning. We also develop the first Membership Inference Attack (MIA) for evaluating and auditing the empirical privacy for the problem of LLM steering via activation editing. Our experiments support the theoretical guarantees by showing improved guarantees for our PSA algorithm compared to several existing non-private techniques.
中文摘要:本研究提出具有差分隐私保证的私有引导对齐算法(PSA),通过激活编辑在保护私有数据集的同时对齐大语言模型,在七个基准测试中以最小性能损失实现有效对齐。
English Summary: This study introduces the Private Steering for LLM Alignment (PSA) algorithm, which uses differentially private activation editing to align large language models with private datasets while minimizing performance loss across seven benchmarks.

Authors:Benjamin Feuer, Chinmay Hegde
Title: WILDCHAT-50M: A Deep Dive Into the Role of Synthetic Data in Post-Training
Abstract:
Language model (LLM) post-training, from DPO to distillation, can refine behaviors and unlock new skills, but the open science supporting these post-training techniques is still in its infancy. One limiting factor has been the difficulty of conducting large-scale comparative analyses of synthetic data generating models and LLM judges. To close this gap, we introduce WILDCHAT-50M, the largest public chat dataset to date. We extend the existing WildChat dataset to include responses not only from GPT, but from over 50 different open-weight models, ranging in size from 0.5B to 104B parameters. We conduct an extensive comparative analysis and demonstrate the potential of this dataset by creating RE-WILD, our own public SFT mix, which outperforms the recent Tulu-3 SFT mixture from Allen AI with only 40% as many samples. Our dataset, samples and code are available at https://github.com/penfever/wildchat-50m.
中文:WILDCHAT-50M作为最大公开聊天数据集的推出,支持对语言模型后训练技术进行广泛比较分析,并证明新SFT混合方法能以更少样本实现更优性能。
English: The introduction of WILDCHAT-50M, the largest public chat dataset, enables extensive comparative analysis of language model post-training techniques and demonstrates superior performance with a new SFT mixture using significantly fewer samples.

Authors:Shi Chen, Lefei Zhang, Liangpei Zhang
Title: HSRMamba: Contextual Spatial-Spectral State Space Model for Single Image Hyperspectral Super-Resolution
Abstract:
Mamba has demonstrated exceptional performance in visual tasks due to its powerful global modeling capabilities and linear computational complexity, offering considerable potential in hyperspectral image super-resolution (HSISR). However, in HSISR, Mamba faces challenges as transforming images into 1D sequences neglects the spatial-spectral structural relationships between locally adjacent pixels, and its performance is highly sensitive to input order, which affects the restoration of both spatial and spectral details. In this paper, we propose HSRMamba, a contextual spatial-spectral modeling state space model for HSISR, to address these issues both locally and globally. Specifically, a local spatial-spectral partitioning mechanism is designed to establish patch-wise causal relationships among adjacent pixels in 3D features, mitigating the local forgetting issue. Furthermore, a global spectral reordering strategy based on spectral similarity is employed to enhance the causal representation of similar pixels across both spatial and spectral dimensions. Finally, experimental results demonstrate our HSRMamba outperforms the state-of-the-art methods in quantitative quality and visual results. Code is available at: https://github.com/Tomchenshi/HSRMamba.
中文摘要:提出的HSRMamba模型通过局部空谱分区和全局谱重排序机制,解决了Mamba在超分辨率任务中忽略局部结构关系和对输入顺序敏感的问题,显著提升了图像重建质量。
English Summary: The proposed HSRMamba model overcomes Mamba's limitations in hyperspectral image super-resolution by implementing local spatial-spectral partitioning and global spectral reordering to better preserve structural relationships and enhance detail restoration.

Authors:Yue Liu, Hongcheng Gao, Shengfang Zhai, Jun Xia, Tianyi Wu, Zhiwei Xue, Yulin Chen, Kenji Kawaguchi, Jiaheng Zhang, Bryan Hooi
Title: GuardReasoner: Towards Reasoning-based LLM Safeguards
Abstract:
As LLMs increasingly impact safety-critical applications, ensuring their safety using guardrails remains a key challenge. This paper proposes GuardReasoner, a new safeguard for LLMs, by guiding the guard model to learn to reason. Concretely, we first create the GuardReasonerTrain dataset, which consists of 127K samples with 460K detailed reasoning steps. Then, we introduce reasoning SFT to unlock the reasoning capability of guard models. In addition, we present hard sample DPO to further strengthen their reasoning ability. In this manner, GuardReasoner achieves better performance, explainability, and generalizability. Extensive experiments and analyses on 13 benchmarks of 3 guardrail tasks demonstrate its superiority. Remarkably, GuardReasoner 8B surpasses GPT-4o+CoT by 5.74% and LLaMA Guard 3 8B by 20.84% F1 score on average. We release the training data, code, and models with different scales (1B, 3B, 8B) of GuardReasoner : https://github.com/yueliu1999/GuardReasoner/.
Chinese: 本文提出GuardReasoner,一种基于推理的LLM安全防护方法,通过专门训练显著提升模型性能与可解释性,在多项基准测试中表现卓越。
English: This paper introduces GuardReasoner, a reasoning-based safeguard for LLMs that enhances safety through specialized training, achieving superior performance and explainability across multiple benchmarks.

Authors:Shiho Noda, Atsuyuki Miyai, Qing Yu, Go Irie, Kiyoharu Aizawa
Title: A Benchmark and Evaluation for Real-World Out-of-Distribution Detection Using Vision-Language Models
Abstract:
Out-of-distribution (OOD) detection is a task that detects OOD samples during inference to ensure the safety of deployed models. However, conventional benchmarks have reached performance saturation, making it difficult to compare recent OOD detection methods. To address this challenge, we introduce three novel OOD detection benchmarks that enable a deeper understanding of method characteristics and reflect real-world conditions. First, we present ImageNet-X, designed to evaluate performance under challenging semantic shifts. Second, we propose ImageNet-FS-X for full-spectrum OOD detection, assessing robustness to covariate shifts (feature distribution shifts). Finally, we propose Wilds-FS-X, which extends these evaluations to real-world datasets, offering a more comprehensive testbed. Our experiments reveal that recent CLIP-based OOD detection methods struggle to varying degrees across the three proposed benchmarks, and none of them consistently outperforms the others. We hope the community goes beyond specific benchmarks and includes more challenging conditions reflecting real-world scenarios. The code is https://github.com/hoshi23/OOD-X-Benchmarks.
中文: 本文提出了三个新的分布外检测基准,以解决传统测试中的性能饱和问题并更真实地反映实际条件,发现现有方法在这些挑战中表现均不稳定。
English: This paper introduces three new benchmarks for out-of-distribution detection to address performance saturation in conventional tests and better reflect real-world conditions, revealing that current methods struggle inconsistently across these challenges.

Authors:Amanturdieva Akmaral, Muhammad Hamza Zafar
Title: Efficient Transformer for High Resolution Image Motion Deblurring
Abstract:
This paper presents a comprehensive study and improvement of the Restormer architecture for high-resolution image motion deblurring. We introduce architectural modifications that reduce model complexity by 18.4% while maintaining or improving performance through optimized attention mechanisms. Our enhanced training pipeline incorporates additional transformations including color jitter, Gaussian blur, and perspective transforms to improve model robustness as well as a new frequency loss term. Extensive experiments on the RealBlur-R, RealBlur-J, and Ultra-High-Definition Motion blurred (UHDM) datasets demonstrate the effectiveness of our approach. The improved architecture shows better convergence behavior and reduced training time while maintaining competitive performance across challenging scenarios. We also provide detailed ablation studies analyzing the impact of our modifications on model behavior and performance. Our results suggest that thoughtful architectural simplification combined with enhanced training strategies can yield more efficient yet equally capable models for motion deblurring tasks. Code and Data Available at: https://github.com/hamzafer/image-deblurring
中文: 本研究通过优化注意力机制和增强训练流程,改进了Restormer架构用于图像运动去模糊,在降低18.4%复杂度的同时保持性能,并在多个数据集上验证了其高效性。
English: This study enhances the Restormer architecture for image motion deblurring by reducing complexity 18.4% through optimized attention mechanisms and improved training with additional transformations and frequency loss, achieving efficient performance across multiple datasets.

Authors:Yuxin Zuo, Shang Qu, Yifei Li, Zhangren Chen, Xuekai Zhu, Ermo Hua, Kaiyan Zhang, Ning Ding, Bowen Zhou
Title: MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding
Abstract:
We introduce MedXpertQA, a highly challenging and comprehensive benchmark to evaluate expert-level medical knowledge and advanced reasoning. MedXpertQA includes 4,460 questions spanning 17 specialties and 11 body systems. It includes two subsets, Text for text evaluation and MM for multimodal evaluation. Notably, MM introduces expert-level exam questions with diverse images and rich clinical information, including patient records and examination results, setting it apart from traditional medical multimodal benchmarks with simple QA pairs generated from image captions. MedXpertQA applies rigorous filtering and augmentation to address the insufficient difficulty of existing benchmarks like MedQA, and incorporates specialty board questions to improve clinical relevance and comprehensiveness. We perform data synthesis to mitigate data leakage risk and conduct multiple rounds of expert reviews to ensure accuracy and reliability. We evaluate 18 leading models on \benchmark. Moreover, medicine is deeply connected to real-world decision-making, providing a rich and representative setting for assessing reasoning abilities beyond mathematics and code. To this end, we develop a reasoning-oriented subset to facilitate the assessment of o1-like models. Code and data are available at: https://github.com/TsinghuaC3I/MedXpertQA
中文:MedXpertQA是一个包含4,460道跨17个专科医学难题的权威评测基准,通过严格筛选和多模态临床数据,专门评估超越传统问答的高级推理能力。
English: MedXpertQA is a challenging medical benchmark featuring 4,460 expert-level questions across 17 specialties, incorporating rigorous filtering and multimodal clinical data to evaluate advanced reasoning beyond traditional QA pairs.

Authors:Jinlu Wang, Yanfeng Sun, Jiapu Wang, Junbin Gao, Shaofan Wang, Jipeng Guo
Title: Contrastive Learning Meets Pseudo-label-assisted Mixup Augmentation: A Comprehensive Graph Representation Framework from Local to Global
Abstract:
Graph Neural Networks (GNNs) have demonstrated remarkable effectiveness in various graph representation learning tasks. However, most existing GNNs focus primarily on capturing local information through explicit graph convolution, often neglecting global message-passing. This limitation hinders the establishment of a collaborative interaction between global and local information, which is crucial for comprehensively understanding graph data. To address these challenges, we propose a novel framework called Comprehensive Graph Representation Learning (ComGRL). ComGRL integrates local information into global information to derive powerful representations. It achieves this by implicitly smoothing local information through flexible graph contrastive learning, ensuring reliable representations for subsequent global exploration. Then ComGRL transfers the locally derived representations to a multi-head self-attention module, enhancing their discriminative ability by uncovering diverse and rich global correlations. To further optimize local information dynamically under the self-supervision of pseudo-labels, ComGRL employs a triple sampling strategy to construct mixed node pairs and applies reliable Mixup augmentation across attributes and structure for local contrastive learning. This approach broadens the receptive field and facilitates coordination between local and global representation learning, enabling them to reinforce each other. Experimental results across six widely used graph datasets demonstrate that ComGRL achieves excellent performance in node classification tasks. The code could be available at https://github.com/JinluWang1002/ComGRL.
中文:提出的ComGRL框架通过图对比学习和多头自注意力机制,将局部信息融入全局上下文以增强图表示学习,在多个数据集的节点分类任务中取得了优异性能。
English: The proposed ComGRL framework enhances graph representation learning by integrating local information into global contexts through graph contrastive learning and multi-head self-attention, achieving superior performance in node classification tasks across multiple datasets.

Authors:Yuqin Cao, Xiongkuo Min, Yixuan Gao, Wei Sun, Guangtao Zhai
Title: AGAV-Rater: Adapting Large Multimodal Model for AI-Generated Audio-Visual Quality Assessment
Abstract:
Many video-to-audio (VTA) methods have been proposed for dubbing silent AI-generated videos. An efficient quality assessment method for AI-generated audio-visual content (AGAV) is crucial for ensuring audio-visual quality. Existing audio-visual quality assessment methods struggle with unique distortions in AGAVs, such as unrealistic and inconsistent elements. To address this, we introduce AGAVQA-3k, the first large-scale AGAV quality assessment dataset, comprising $3,382$ AGAVs from $16$ VTA methods. AGAVQA-3k includes two subsets: AGAVQA-MOS, which provides multi-dimensional scores for audio quality, content consistency, and overall quality, and AGAVQA-Pair, designed for optimal AGAV pair selection. We further propose AGAV-Rater, a LMM-based model that can score AGAVs, as well as audio and music generated from text, across multiple dimensions, and selects the best AGAV generated by VTA methods to present to the user. AGAV-Rater achieves state-of-the-art performance on AGAVQA-3k, Text-to-Audio, and Text-to-Music datasets. Subjective tests also confirm that AGAV-Rater enhances VTA performance and user experience. The dataset and code is available at https://github.com/charlotte9524/AGAV-Rater.
中文: 本文提出了首个大规模AI生成音视频质量评估数据集AGAVQA-3k,并开发了多模态模型AGAV-Rater,该模型在评估和选择最佳音视频内容方面实现了最先进的性能表现。
English: This paper introduces AGAVQA-3k, the first large-scale dataset for assessing AI-generated audio-visual content quality, and proposes AGAV-Rater, a multimodal model that achieves state-of-the-art performance in evaluating and selecting optimal audio-visual outputs.

Authors:Haoxiong Liu, Jiacheng Sun, Zhenguo Li, Andrew C Yao
Title: ProofAug: Efficient Neural Theorem Proving via Fine-grained Proof Structure Analysis
Abstract:
The synergy between deep learning models and traditional automation tools, such as built-in tactics of the proof assistant and off-the-shelf automated theorem provers, plays a crucial role in developing robust and efficient neural theorem provers(NTPs). However, for proof synthesis with LLMs, previous work applies automation tools either only when explicitly invoked by the model or at a single granularity level, failing to fully exploit their power. To solve this issue, we propose ProofAug, a procedure that equips LLMs with automation methods at various granularities through fine-grained structure analysis of model-generated proof proposals. ProofAug also serves as a versatile plug-and-play module that seamlessly integrates with any tree-search algorithm, enabling our construction of an efficient recursive proving (ERP) module to further enhance performance. The superiority of our method is validated on the miniF2F benchmark using the open-source deepseek-math-7b-base model and the Isabelle proof assistant. Notably, by additionally employing a mixed prompting strategy, we achieve a cumulative pass rate of 66.0% after curation of the dataset (61.9% for the original version) with 2100 queries to the model per problem (In contrast, the previous SOTA in Isabelle, Subgoal-XL, only achieves 56.1% using 16384 queries per problem). We also implement a Lean 4 version of ProofAug that can improve the pass@1 performance of Kimina-Prover-Preview-Distill-1.5B from 44.3% to 50.4% on miniF2F-test. Our code is available at https://github.com/haoxiongliu/ProofAug.
中文摘要:本文提出ProofAug方法,通过多粒度自动化工具增强大语言模型在神经定理证明中的能力,在miniF2F基准测试中以更少计算查询量实现了最先进的性能表现。
English Summary: This paper introduces ProofAug, a method that enhances large language models with multi-granular automation tools for neural theorem proving, achieving state-of-the-art performance on the miniF2F benchmark with significantly fewer computational queries.

Authors:Amitay Sicherman, Kira Radinsky
Title: ReactEmbed: A Cross-Domain Framework for Protein-Molecule Representation Learning via Biochemical Reaction Networks
Abstract:
The challenge in computational biology and drug discovery lies in creating comprehensive representations of proteins and molecules that capture their intrinsic properties and interactions. Traditional methods often focus on unimodal data, such as protein sequences or molecular structures, limiting their ability to capture complex biochemical relationships. This work enhances these representations by integrating biochemical reactions encompassing interactions between molecules and proteins. By leveraging reaction data alongside pre-trained embeddings from state-of-the-art protein and molecule models, we develop ReactEmbed, a novel method that creates a unified embedding space through contrastive learning. We evaluate ReactEmbed across diverse tasks, including drug-target interaction, protein-protein interaction, protein property prediction, and molecular property prediction, consistently surpassing all current state-of-the-art models. Notably, we showcase ReactEmbed's practical utility through successful implementation in lipid nanoparticle-based drug delivery, enabling zero-shot prediction of blood-brain barrier permeability for protein-nanoparticle complexes. The code and comprehensive database of reaction pairs are available for open use at \href{https://github.com/amitaysicherman/ReactEmbed}{GitHub}.
中文: 本研究提出ReactEmbed方法,通过整合生化反应与预训练的蛋白质和分子嵌入来创建统一表征,在多项任务中均超越现有最优模型,并成功应用于药物递送领域展示了其实用价值。
English: This work introduces ReactEmbed, a novel method that integrates biochemical reactions with pre-trained protein and molecule embeddings to create a unified representation, consistently outperforming state-of-the-art models across various tasks and demonstrating practical utility in drug delivery applications.

Authors:David Mallasén, Pasquale Davide Schiavone, Alberto A. Del Barrio, Manuel Prieto-Matias, David Atienza
Title: Increasing the Energy-Efficiency of Wearables Using Low-Precision Posit Arithmetic with PHEE
Abstract:
Wearable biomedical devices are increasingly being used for continuous patient health monitoring, enabling real-time insights and extended data collection without the need for prolonged hospital stays. These devices must be energy efficient to minimize battery size, improve comfort, and reduce recharging intervals. This paper investigates the use of specialized low-precision arithmetic formats to enhance the energy efficiency of biomedical wearables. Specifically, we explore posit arithmetic, a floating-point-like representation, in two key applications: cough detection for chronic cough monitoring and R peak detection in ECG analysis. Simulations reveal that 16-bit posits can replace 32-bit IEEE 754 floating point numbers with minimal accuracy loss in cough detection. For R peak detection, posit arithmetic achieves satisfactory accuracy with as few as 10 or 8 bits, compared to the 16-bit requirement for floating-point formats. To further this exploration, we introduce PHEE, a modular and extensible architecture that integrates the Coprosit posit coprocessor within a RISC-V-based system. Using the X-HEEP framework, PHEE seamlessly incorporates posit arithmetic, demonstrating reduced hardware area and power consumption compared to a floating-point counterpart system. Post-synthesis results targeting 16nm TSMC technology show that the posit hardware targeting these biomedical applications can be 38% smaller and consume up to 54% less energy at the functional unit level, with no performance compromise. These findings establish the potential of low-precision posit arithmetic to significantly improve the energy efficiency of wearable biomedical devices.
可穿戴生物医学设备通过采用低精度Posit算法,能在咳嗽检测和心电图分析等应用中显著提升能效,硬件尺寸最多缩小38%,能耗降低54%,且不影响性能。
Wearable biomedical devices can greatly enhance energy efficiency by adopting low-precision posit arithmetic, achieving hardware size reductions of up to 38% and energy savings of 54% without compromising performance in applications like cough and ECG monitoring.

Authors:Qingxiang Liu, Chenghao Liu, Sheng Sun, Di Yao, Yuxuan Liang
Title: GDformer: Going Beyond Subsequence Isolation for Multivariate Time Series Anomaly Detection
Abstract:
Unsupervised anomaly detection of multivariate time series is a challenging task, given the requirements of deriving a compact detection criterion without accessing the anomaly points. The existing methods are mainly based on reconstruction error or association divergence, which are both confined to isolated subsequences with limited horizons, hardly promising unified series-level criterion. In this paper, we propose the Global Dictionary-enhanced Transformer (GDformer) with a renovated dictionary-based cross attention mechanism to cultivate the global representations shared by all normal points in the entire series. Accordingly, the cross-attention maps reflect the correlation weights between the point and global representations, which naturally leads to the representation-wise similarity-based detection criterion. To foster more compact detection boundary, prototypes are introduced to capture the distribution of normal point-global correlation weights. GDformer consistently achieves state-of-the-art unsupervised anomaly detection performance on five real-world benchmark datasets. Further experiments validate the global dictionary has great transferability among various datasets. The code is available at https://github.com/yuppielqx/GDformer.
中文摘要:本文提出GDformer方法,通过全局字典和交叉注意力机制构建统一检测标准,在多变量时间序列的无监督异常检测中实现了最优性能。
English Summary: The paper introduces GDformer, a novel unsupervised anomaly detection method for multivariate time series that uses a global dictionary and cross-attention mechanism to establish a unified detection criterion, achieving state-of-the-art performance across multiple datasets.

Authors:Jinyao Guo, Chengpeng Wang, Xiangzhe Xu, Zian Su, Xiangyu Zhang
Title: RepoAudit: An Autonomous LLM-Agent for Repository-Level Code Auditing
Abstract:
Code auditing is the process of reviewing code with the aim of identifying bugs. Large Language Models (LLMs) have demonstrated promising capabilities for this task without requiring compilation, while also supporting user-friendly customization. However, auditing a code repository with LLMs poses significant challenges: limited context windows and hallucinations can degrade the quality of bug reports, and analyzing large-scale repositories incurs substantial time and token costs, hindering efficiency and scalability. This work introduces an LLM-based agent, RepoAudit, designed to perform autonomous repository-level code auditing. Equipped with agent memory, RepoAudit explores the codebase on demand by analyzing data-flow facts along feasible program paths within individual functions. It further incorporates a validator module to mitigate hallucinations by verifying data-flow facts and checking the satisfiability of path conditions associated with potential bugs, thereby reducing false positives. RepoAudit detects 40 true bugs across 15 real-world benchmark projects with a precision of 78.43%, requiring on average only 0.44 hours and $2.54 per project. Also, it detects 185 new bugs in high-profile projects, among which 174 have been confirmed or fixed. We have open-sourced RepoAudit at https://github.com/PurCL/RepoAudit.
中文:RepoAudit是一种基于大语言模型的智能代理,通过分析数据流事实和验证潜在错误来自主审计代码仓库,在检测大量真实漏洞的同时实现了高精度和高效率。
English: RepoAudit is an LLM-based agent that autonomously audits code repositories by analyzing data-flow facts and validating potential bugs, achieving high precision and efficiency while detecting numerous real-world bugs.

Authors:HaeJin Lee, Shubhanshu Mishra, Apratim Mishra, Zhiwen You, Jinseok Kim, Jana Diesner
Title: Revisiting gender bias research in bibliometrics: Standardizing methodological variability using Scholarly Data Analysis (SoDA) Cards
Abstract:
Gender biases in scholarly metrics remain a persistent concern, despite numerous bibliometric studies exploring their presence and absence across productivity, impact, acknowledgment, and self-citations. However, methodological inconsistencies, particularly in author name disambiguation and gender identification, limit the reliability and comparability of these studies, potentially perpetuating misperceptions and hindering effective interventions. A review of 70 relevant publications over the past 12 years reveals a wide range of approaches, from name-based and manual searches to more algorithmic and gold-standard methods, with no clear consensus on best practices. This variability, compounded by challenges such as accurately disambiguating Asian names and managing unassigned gender labels, underscores the urgent need for standardized and robust methodologies. To address this critical gap, we propose the development and implementation of ``Scholarly Data Analysis (SoDA) Cards." These cards will provide a structured framework for documenting and reporting key methodological choices in scholarly data analysis, including author name disambiguation and gender identification procedures. By promoting transparency and reproducibility, SoDA Cards will facilitate more accurate comparisons and aggregations of research findings, ultimately supporting evidence-informed policymaking and enabling the longitudinal tracking of analytical approaches in the study of gender and other social biases in academia.
中文摘要:学术指标中的性别偏见因作者消歧和性别识别方法不一致而持续存在,为此提出“学术数据分析卡”以标准化报告流程,提升研究的透明度和可比性。
English Summary: Gender bias in scholarly metrics persists due to inconsistent methodologies in author disambiguation and gender identification, prompting the proposal of "SoDA Cards" to standardize reporting and enhance research transparency and comparability.

Authors:Siyuan Jiang, Yihan Hu, Wenjie Li, Pengcheng Zeng
Title: DeepFRC: An End-to-End Deep Learning Model for Functional Registration and Classification
Abstract:
Functional data - observations in the form of curves or trajectories - arise in diverse domains such as biomedical sensing, motion capture, and handwriting recognition. A core challenge in functional data analysis (FDA) is accounting for phase variability, where misaligned temporal patterns hinder accurate inference. We introduce DeepFRC, an end-to-end deep learning framework for joint functional registration and classification. Unlike conventional approaches that decouple alignment and prediction, DeepFRC integrates class-aware elastic warping and a learnable basis representation into a unified architecture. This design enables temporal alignment and dimensionality reduction to be jointly optimized with classification, improving both interpretability and accuracy. We establish the first theoretical connection between alignment quality and generalization error, and validate our model on synthetic and real-world benchmarks. DeepFRC consistently outperforms state-of-the-art methods, especially in scenarios with complex temporal misalignment. Code is available at: https://github.com/Drivergo-93589/DeepFRC.
Chinese: DeepFRC 是一个端到端的深度学习框架,通过整合类别感知的弹性扭曲和可学习基表示,联合优化功能配准与分类,在处理时间错位时提高了准确性和可解释性,并优于现有先进方法。
English: DeepFRC is an end-to-end deep learning framework that jointly optimizes functional registration and classification by integrating class-aware elastic warping and a learnable basis representation, improving accuracy and interpretability while outperforming state-of-the-art methods in handling temporal misalignment.

Authors:Siyuan Jiang, Yihan Hu, Wenjie Li, Pengcheng Zeng
Title: DeepFRC: An End-to-End Deep Learning Model for Functional Registration and Classification
Abstract:
Functional data, representing curves or trajectories, are ubiquitous in fields like biomedicine and motion analysis. A fundamental challenge is phase variability -- temporal misalignments that obscure underlying patterns and degrade model performance. Current methods often address registration (alignment) and classification as separate, sequential tasks. This paper introduces DeepFRC, an end-to-end deep learning framework that jointly learns diffeomorphic warping functions and a classifier within a unified architecture. DeepFRC combines a neural deformation operator for elastic alignment, a spectral representation using Fourier basis for smooth functional embedding, and a class-aware contrastive loss that promotes both intra-class coherence and inter-class separation. We provide the first theoretical guarantees for such a joint model, proving its ability to approximate optimal warpings and establishing a data-dependent generalization bound that formally links registration fidelity to classification performance. Extensive experiments on synthetic and real-world datasets demonstrate that DeepFRC consistently outperforms state-of-the-art methods in both alignment quality and classification accuracy, while ablation studies validate the synergy of its components. DeepFRC also shows notable robustness to noise, missing data, and varying dataset scales. Code is available at https://github.com/Drivergo-93589/DeepFRC.
Chinese: DeepFRC 是一个端到端的深度学习框架,通过整合类别感知的弹性扭曲和可学习基表示,联合优化功能配准与分类,在处理时间错位时提高了准确性和可解释性,并优于现有先进方法。
English: DeepFRC is an end-to-end deep learning framework that jointly optimizes functional registration and classification by integrating class-aware elastic warping and a learnable basis representation, improving accuracy and interpretability while outperforming state-of-the-art methods in handling temporal misalignment.

Authors:Yibo Wang, Tiansheng Huang, Li Shen, Huanjin Yao, Haotian Luo, Rui Liu, Naiqiang Tan, Jiaxing Huang, Dacheng Tao
Title: Panacea: Mitigating Harmful Fine-tuning for Large Language Models via Post-fine-tuning Perturbation
Abstract:
Harmful fine-tuning attack introduces significant security risks to the fine-tuning services. Mainstream defenses aim to vaccinate the model such that the later harmful fine-tuning attack is less effective. However, our evaluation results show that such defenses are fragile -- with a few fine-tuning steps, the model still can learn the harmful knowledge. To this end, we do further experiment and find that an embarrassingly simple solution -- adding purely random perturbations to the fine-tuned model, can recover the model from harmful behavior, though it leads to a degradation in the model's fine-tuning performance. To address the degradation of fine-tuning performance, we further propose Panacea, which optimizes an adaptive perturbation that will be applied to the model after fine-tuning. Panacea maintains model's safety alignment performance without compromising downstream fine-tuning performance. Comprehensive experiments are conducted on different harmful ratios, fine-tuning tasks and mainstream LLMs, where the average harmful scores are reduced by up-to 21.5%, while maintaining fine-tuning performance. As a by-product, we analyze the optimized perturbation and show that different layers in various LLMs have distinct safety coefficients. Source code available at https://github.com/w-yibo/Panacea
中文: 研究发现现有针对有害微调攻击的防御措施效果不佳,但简单的随机扰动方法可降低风险,尽管会牺牲微调性能,因此提出Panacea方案,通过自适应扰动在保持安全的同时不影响性能。
English: The study reveals that current defenses against harmful fine-tuning attacks are ineffective, but a simple random perturbation method can mitigate risks, though it compromises fine-tuning performance, leading to the development of Panacea, which uses adaptive perturbations to maintain safety without performance loss.

Authors:Kumar Ashutosh, Yossi Gandelsman, Xinlei Chen, Ishan Misra, Rohit Girdhar
Title: LLMs can see and hear without any training
Abstract:
We present MILS: Multimodal Iterative LLM Solver, a surprisingly simple, training-free approach, to imbue multimodal capabilities into your favorite LLM. Leveraging their innate ability to perform multi-step reasoning, MILS prompts the LLM to generate candidate outputs, each of which are scored and fed back iteratively, eventually generating a solution to the task. This enables various applications that typically require training specialized models on task-specific data. In particular, we establish a new state-of-the-art on emergent zero-shot image, video and audio captioning. MILS seamlessly applies to media generation as well, discovering prompt rewrites to improve text-to-image generation, and even edit prompts for style transfer! Finally, being a gradient-free optimization approach, MILS can invert multimodal embeddings into text, enabling applications like cross-modal arithmetic.
中文: MILS是一种无需训练的方法,通过迭代生成和评分候选答案来增强LLM的多模态能力,在图像、视频和音频描述等任务中实现了最先进的性能。
English: MILS is a training-free method that enhances multimodal capabilities in LLMs through iterative candidate generation and scoring, achieving state-of-the-art performance in tasks like captioning and media generation.

Authors:Da Chang, Yu Li, Ganzhao Yuan
Title: AlphaAdam:Asynchronous Masked Optimization with Dynamic Alpha for Selective Updates
Abstract:
In the training of large language models (LLMs), updating parameters more efficiently and stably has always been an important challenge. To achieve efficient parameter updates, existing methods usually achieve performance comparable to full parameter updates through methods such as low-dimensional decomposition or layer-wise selective updates. In this work, we propose AlphaAdam, an optimization framework for LLM from the perspective of intra-layer parameter updates. By decoupling parameter updates and dynamically adjusting their strength, AlphaAdam accelerates convergence and improves training stability. We construct parameter masks based on the consistency of historical momentum and gradient direction and combine them with an adaptive mask strength strategy to ensure efficient optimization and theoretical convergence guarantees, which is also applicable to most momentum-based optimizers. Extensive experiments show that AlphaAdam outperforms state-of-the-art methods such as AdamW in terms of convergence speed and computational efficiency across tasks, including GPT-2 pre-trained and fine-tuned RoBERTa and Llama-7B. Our AlphaAdam implements an optimizer enhancement framework for LLMs through intra-layer asynchronous masked adaptive updates. Our code is available in this https://github.com/MaeChd/AlphaAdam.
中文: AlphaAdam是一种通过层内异步掩码自适应更新来优化大语言模型训练的框架,相比AdamW等方法,在收敛速度和计算效率上表现更优。
English: AlphaAdam is an optimization framework that enhances large language model training by enabling intra-layer asynchronous masked adaptive updates, improving convergence speed and computational efficiency compared to methods like AdamW.

Authors:Akinori F. Ebihara, Taiki Miyagawa, Kazuyuki Sakurai, Hitoshi Imaoka
Title: Learning the Optimal Stopping for Early Classification within Finite Horizons via Sequential Probability Ratio Test
Abstract:
Time-sensitive machine learning benefits from Sequential Probability Ratio Test (SPRT), which provides an optimal stopping time for early classification of time series. However, in finite horizon scenarios, where input lengths are finite, determining the optimal stopping rule becomes computationally intensive due to the need for backward induction, limiting practical applicability. We thus introduce FIRMBOUND, an SPRT-based framework that efficiently estimates the solution to backward induction from training data, bridging the gap between optimal stopping theory and real-world deployment. It employs density ratio estimation and convex function learning to provide statistically consistent estimators for sufficient statistic and conditional expectation, both essential for solving backward induction; consequently, FIRMBOUND minimizes Bayes risk to reach optimality. Additionally, we present a faster alternative using Gaussian process regression, which significantly reduces training time while retaining low deployment overhead, albeit with potential compromise in statistical consistency. Experiments across independent and identically distributed (i.i.d.), non-i.i.d., binary, multiclass, synthetic, and real-world datasets show that FIRMBOUND achieves optimalities in the sense of Bayes risk and speed-accuracy tradeoff. Furthermore, it advances the tradeoff boundary toward optimality when possible and reduces decision-time variance, ensuring reliable decision-making. Code is publicly available at https://github.com/Akinori-F-Ebihara/FIRMBOUND
Chinese: FIRMBOUND是一种基于SPRT的高效框架,通过密度比估计和凸函数学习来估计反向归纳解,克服了有限时间序列分类中的计算瓶颈,在多种数据集上实现了最优贝叶斯风险并提升了速度-精度权衡性能。
English: FIRMBOUND is an efficient SPRT-based framework that overcomes computational limitations in finite horizon time series classification by estimating backward induction solutions through density ratio estimation and convex function learning, achieving optimal Bayes risk and improved speed-accuracy tradeoffs across diverse datasets.

Authors:Bartosz Cywiński, Kamil Deja
Title: SAeUron: Interpretable Concept Unlearning in Diffusion Models with Sparse Autoencoders
Abstract:
Diffusion models, while powerful, can inadvertently generate harmful or undesirable content, raising significant ethical and safety concerns. Recent machine unlearning approaches offer potential solutions but often lack transparency, making it difficult to understand the changes they introduce to the base model. In this work, we introduce SAeUron, a novel method leveraging features learned by sparse autoencoders (SAEs) to remove unwanted concepts in text-to-image diffusion models. First, we demonstrate that SAEs, trained in an unsupervised manner on activations from multiple denoising timesteps of the diffusion model, capture sparse and interpretable features corresponding to specific concepts. Building on this, we propose a feature selection method that enables precise interventions on model activations to block targeted content while preserving overall performance. Our evaluation shows that SAeUron outperforms existing approaches on the UnlearnCanvas benchmark for concepts and style unlearning, and effectively eliminates nudity when evaluated with I2P. Moreover, we show that with a single SAE, we can remove multiple concepts simultaneously and that in contrast to other methods, SAeUron mitigates the possibility of generating unwanted content under adversarial attack. Code and checkpoints are available at https://github.com/cywinski/SAeUron.
中文:SAeUron是一种创新方法,利用稀疏自编码器精确移除文生图扩散模型中的不良概念,在安全性和可解释性上优于现有方法,同时保持模型性能。
English: SAeUron is a novel method that uses sparse autoencoders to precisely remove unwanted concepts in text-to-image diffusion models, outperforming existing approaches in safety and interpretability while maintaining model performance.

Authors:Lei Cheng, Siyang Cao
Title: TransRAD: Retentive Vision Transformer for Enhanced Radar Object Detection
Abstract:
Despite significant advancements in environment perception capabilities for autonomous driving and intelligent robotics, cameras and LiDARs remain notoriously unreliable in low-light conditions and adverse weather, which limits their effectiveness. Radar serves as a reliable and low-cost sensor that can effectively complement these limitations. However, radar-based object detection has been underexplored due to the inherent weaknesses of radar data, such as low resolution, high noise, and lack of visual information. In this paper, we present TransRAD, a novel 3D radar object detection model designed to address these challenges by leveraging the Retentive Vision Transformer (RMT) to more effectively learn features from information-dense radar Range-Azimuth-Doppler (RAD) data. Our approach leverages the Retentive Manhattan Self-Attention (MaSA) mechanism provided by RMT to incorporate explicit spatial priors, thereby enabling more accurate alignment with the spatial saliency characteristics of radar targets in RAD data and achieving precise 3D radar detection across Range-Azimuth-Doppler dimensions. Furthermore, we propose Location-Aware NMS to effectively mitigate the common issue of duplicate bounding boxes in deep radar object detection. The experimental results demonstrate that TransRAD outperforms state-of-the-art methods in both 2D and 3D radar detection tasks, achieving higher accuracy, faster inference speed, and reduced computational complexity. Code is available at https://github.com/radar-lab/TransRAD
Chinese: TransRAD是一种新型3D雷达目标检测模型,通过采用保留视觉变换器和位置感知非极大值抑制技术,有效克服了雷达数据的固有缺陷,在精度、速度和计算效率上均超越了现有最优方法。
English: TransRAD is a novel 3D radar object detection model that addresses the limitations of radar data by leveraging the Retentive Vision Transformer and introducing Location-Aware NMS, achieving superior performance in accuracy, speed, and efficiency compared to state-of-the-art methods.

Authors:Rui Min, Tianyu Pang, Chao Du, Qian Liu, Minhao Cheng, Min Lin
Title: Improving Your Model Ranking on Chatbot Arena by Vote Rigging
Abstract:
Chatbot Arena is a popular platform for evaluating LLMs by pairwise battles, where users vote for their preferred response from two randomly sampled anonymous models. While Chatbot Arena is widely regarded as a reliable LLM ranking leaderboard, we show that crowdsourced voting can be rigged to improve (or decrease) the ranking of a target model $m_{t}$. We first introduce a straightforward target-only rigging strategy that focuses on new battles involving $m_{t}$, identifying it via watermarking or a binary classifier, and exclusively voting for $m_{t}$ wins. However, this strategy is practically inefficient because there are over $190$ models on Chatbot Arena and on average only about $1\%$ of new battles will involve $m_{t}$. To overcome this, we propose omnipresent rigging strategies, exploiting the Elo rating mechanism of Chatbot Arena that any new vote on a battle can influence the ranking of the target model $m_{t}$, even if $m_{t}$ is not directly involved in the battle. We conduct experiments on around $1.7$ million historical votes from the Chatbot Arena Notebook, showing that omnipresent rigging strategies can improve model rankings by rigging only hundreds of new votes. While we have evaluated several defense mechanisms, our findings highlight the importance of continued efforts to prevent vote rigging. Our code is available at https://github.com/sail-sg/Rigging-ChatbotArena.
中文: Chatbot Arena的众包投票系统易受操纵策略影响,通过利用Elo评分机制,即使少量投票也能操控模型排名,凸显了加强防御的必要性。
English: Chatbot Arena's crowdsourced voting system is vulnerable to rigging strategies that can manipulate model rankings by exploiting the Elo rating mechanism, even with minimal vote interference, highlighting the need for stronger defenses.

Authors:Aude Vuilliomenet, Santiago Martínez Balvanera, Oisin Mac Aodha, Kate E. Jones, Duncan Wilson
Title: acoupi: An Open-Source Python Framework for Deploying Bioacoustic AI Models on Edge Devices
Abstract:
1. Passive acoustic monitoring (PAM) coupled with artificial intelligence (AI) is becoming an essential tool for biodiversity monitoring. Traditional PAM systems require manual data offloading and impose substantial demands on storage and computing infrastructure. The combination of on-device AI-based processing and network connectivity enables local data analysis and transmission of only relevant information, greatly reducing storage needs. However, programming these devices for robust operation is challenging, requiring expertise in embedded systems and software engineering. Despite the increase in AI-based models for bioacoustics, their full potential remains unrealized without accessible tools to deploy them on custom hardware and tailor device behaviour to specific monitoring goals. 2. To address this challenge, we develop acoupi, an open-source Python framework that simplifies the creation and deployment of smart bioacoustic devices. acoupi integrates audio recording, AI-based data processing, data management, and real-time wireless messaging into a unified and configurable framework. By modularising key elements of the bioacoustic monitoring workflow, acoupi allows users to easily customise, extend, or select specific components to fit their unique monitoring needs. 3. We demonstrate the flexibility of acoupi by integrating two bioacoustic classifiers: BirdNET, for the classification of bird species, and BatDetect2, for the classification of UK bat species. We test the reliability of acoupi over a month-long deployment of two acoupi-powered devices in a UK urban park. 4. acoupi can be deployed on low-cost hardware such as the Raspberry Pi and can be customised for various applications. acoupi standardised framework and simplified tools facilitate the adoption of AI-powered PAM systems for researchers and conservationists. acoupi is on GitHub at https://github.com/acoupi/acoupi.
中文摘要:acoupi开源框架通过集成AI处理和无线通信技术,简化了智能生物声学监测设备的开发,使研究人员能够利用低成本硬件实现可定制的生物多样性监测方案。
English Summary: The acoupi framework simplifies the development of smart bioacoustic monitoring devices by integrating AI processing and wireless communication, enabling customizable and efficient biodiversity tracking using low-cost hardware.

Authors:Ajinkya Khoche, Qingwen Zhang, Laura Pereira Sanchez, Aron Asefaw, Sina Sharif Mansouri, Patric Jensfelt
Title: SSF: Sparse Long-Range Scene Flow for Autonomous Driving
Abstract:
Scene flow enables an understanding of the motion characteristics of the environment in the 3D world. It gains particular significance in the long-range, where object-based perception methods might fail due to sparse observations far away. Although significant advancements have been made in scene flow pipelines to handle large-scale point clouds, a gap remains in scalability with respect to long-range. We attribute this limitation to the common design choice of using dense feature grids, which scale quadratically with range. In this paper, we propose Sparse Scene Flow (SSF), a general pipeline for long-range scene flow, adopting a sparse convolution based backbone for feature extraction. This approach introduces a new challenge: a mismatch in size and ordering of sparse feature maps between time-sequential point scans. To address this, we propose a sparse feature fusion scheme, that augments the feature maps with virtual voxels at missing locations. Additionally, we propose a range-wise metric that implicitly gives greater importance to faraway points. Our method, SSF, achieves state-of-the-art results on the Argoverse2 dataset, demonstrating strong performance in long-range scene flow estimation. Our code will be released at https://github.com/KTH-RPL/SSF.git.
Chinese Summary: 本文提出稀疏场景流(SSF)方法,通过采用稀疏卷积特征提取和特征融合方案解决远距离场景流估计的扩展性难题,在Argoverse2数据集上实现了最优性能。
English Summary: The paper introduces Sparse Scene Flow (SSF), a novel pipeline that overcomes scalability limitations in long-range 3D motion estimation by employing sparse convolutions and feature fusion, achieving state-of-the-art performance on the Argoverse2 dataset.

Authors:Fabrizio Sandri, Elia Cunegatti, Giovanni Iacca
Title: 2SSP: A Two-Stage Framework for Structured Pruning of LLMs
Abstract:
We propose a novel Two-Stage framework for Structured Pruning (\textsc{2SSP}) for pruning Large Language Models (LLMs), which combines two different strategies of pruning, namely Width and Depth Pruning. The first stage (Width Pruning) removes entire neurons, hence their corresponding rows and columns, aiming to preserve the connectivity among the pruned structures in the intermediate state of the Feed-Forward Networks in each Transformer block. This is done based on an importance score measuring the impact of each neuron on the output magnitude. The second stage (Depth Pruning), instead, removes entire Attention submodules. This is done by applying an iterative process that removes the Attention with the minimum impact on a given metric of interest (in our case, perplexity). We also propose a novel mechanism to balance the sparsity rate of the two stages w.r.t. to the desired global sparsity. We test \textsc{2SSP} on four LLM families and three sparsity rates (25\%, 37.5\%, and 50\%), measuring the resulting perplexity over three language modeling datasets as well as the performance over six downstream tasks. Our method consistently outperforms five state-of-the-art competitors over three language modeling and six downstream tasks, with an up to two-order-of-magnitude gain in terms of pruning time. The code is available at https://github.com/FabrizioSandri/2SSP.
中文: 我们提出了一种新颖的两阶段结构化剪枝框架(2SSP),通过结合宽度剪枝和深度剪枝来缩减大语言模型的规模,在保持性能的同时显著优于现有方法并大幅提升剪枝效率。
English: We introduce a novel Two-Stage Structured Pruning (2SSP) framework for Large Language Models that combines width and depth pruning to reduce model size while maintaining performance, outperforming existing methods with significant efficiency gains.

Authors:Keshav Bhandari, Geraint A. Wiggins, Simon Colton
Title: Yin-Yang: Developing Motifs With Long-Term Structure And Controllability
Abstract:
Transformer models have made great strides in generating symbolically represented music with local coherence. However, controlling the development of motifs in a structured way with global form remains an open research area. One of the reasons for this challenge is due to the note-by-note autoregressive generation of such models, which lack the ability to correct themselves after deviations from the motif. In addition, their structural performance on datasets with shorter durations has not been studied in the literature. In this study, we propose Yin-Yang, a framework consisting of a phrase generator, phrase refiner, and phrase selector models for the development of motifs into melodies with long-term structure and controllability. The phrase refiner is trained on a novel corruption-refinement strategy which allows it to produce melodic and rhythmic variations of an original motif at generation time, thereby rectifying deviations of the phrase generator. We also introduce a new objective evaluation metric for quantifying how smoothly the motif manifests itself within the piece. Evaluation results show that our model achieves better performance compared to state-of-the-art transformer models while having the advantage of being controllable and making the generated musical structure semi-interpretable, paving the way for musical analysis. Our code and demo page can be found at https://github.com/keshavbhandari/yinyang.
Chinese: Yin-Yang框架通过短语生成、精炼和选择机制,改进了Transformer模型,实现了对主题发展的可控性和旋律的长期结构构建,在性能和可解释性上均优于现有方法。
English: The Yin-Yang framework enhances transformer models by enabling controlled motif development and long-term melodic structure through phrase generation, refinement, and selection, outperforming existing methods in both performance and interpretability.

Authors:Ahmed Sharshar, Yasser Attia, Mohammad Yaqub, Mohsen Guizani
Title: PulmoFusion: Advancing Pulmonary Health with Efficient Multi-Modal Fusion
Abstract:
Traditional remote spirometry lacks the precision required for effective pulmonary monitoring. We present a novel, non-invasive approach using multimodal predictive models that integrate RGB or thermal video data with patient metadata. Our method leverages energy-efficient Spiking Neural Networks (SNNs) for the regression of Peak Expiratory Flow (PEF) and classification of Forced Expiratory Volume (FEV1) and Forced Vital Capacity (FVC), using lightweight CNNs to overcome SNN limitations in regression tasks. Multimodal data integration is improved with a Multi-Head Attention Layer, and we employ K-Fold validation and ensemble learning to boost robustness. Using thermal data, our SNN models achieve 92% accuracy on a breathing-cycle basis and 99.5% patient-wise. PEF regression models attain Relative RMSEs of 0.11 (thermal) and 0.26 (RGB), with an MAE of 4.52% for FEV1/FVC predictions, establishing state-of-the-art performance. Code and dataset can be found on https://github.com/ahmed-sharshar/RespiroDynamics.git
中文: 本研究提出了一种新型无创多模态方法,利用脉冲神经网络结合热成像或RGB视频数据,实现了对肺功能关键指标的高精度预测,在呼吸监测领域达到了领先性能水平。
English: This study introduces a non-invasive multimodal approach using spiking neural networks and thermal or RGB video data to accurately monitor pulmonary function, achieving state-of-the-art performance in predicting key respiratory metrics with high precision.

Authors:Derui Wang, Kristen Moore, Diksha Goel, Minjune Kim, Gang Li, Yang Li, Robin Doss, Minhui Xue, Bo Li, Seyit Camtepe, Liming Zhu
Title: CAMP in the Odyssey: Provably Robust Reinforcement Learning with Certified Radius Maximization
Abstract:
Deep reinforcement learning (DRL) has gained widespread adoption in control and decision-making tasks due to its strong performance in dynamic environments. However, DRL agents are vulnerable to noisy observations and adversarial attacks, and concerns about the adversarial robustness of DRL systems have emerged. Recent efforts have focused on addressing these robustness issues by establishing rigorous theoretical guarantees for the returns achieved by DRL agents in adversarial settings. Among these approaches, policy smoothing has proven to be an effective and scalable method for certifying the robustness of DRL agents. Nevertheless, existing certifiably robust DRL relies on policies trained with simple Gaussian augmentations, resulting in a suboptimal trade-off between certified robustness and certified return. To address this issue, we introduce a novel paradigm dubbed \texttt{C}ertified-r\texttt{A}dius-\texttt{M}aximizing \texttt{P}olicy (\texttt{CAMP}) training. \texttt{CAMP} is designed to enhance DRL policies, achieving better utility without compromising provable robustness. By leveraging the insight that the global certified radius can be derived from local certified radii based on training-time statistics, \texttt{CAMP} formulates a surrogate loss related to the local certified radius and optimizes the policy guided by this surrogate loss. We also introduce \textit{policy imitation} as a novel technique to stabilize \texttt{CAMP} training. Experimental results demonstrate that \texttt{CAMP} significantly improves the robustness-return trade-off across various tasks. Based on the results, \texttt{CAMP} can achieve up to twice the certified expected return compared to that of baselines. Our code is available at https://github.com/NeuralSec/camp-robust-rl.
中文: 提出的CAMP训练范式通过基于局部认证半径的替代损失优化深度强化学习策略,在保持可证明鲁棒性的同时,将认证期望回报提升至基线方法的两倍。
English: The proposed CAMP training paradigm enhances deep reinforcement learning policies by optimizing a surrogate loss based on local certified radii, achieving superior certified robustness and up to double the certified expected return compared to baseline methods.

Authors:Wonbin Kweon, Sanghwan Jang, SeongKu Kang, Hwanjo Yu
Title: Uncertainty Quantification and Decomposition for LLM-based Recommendation
Abstract:
Despite the widespread adoption of large language models (LLMs) for recommendation, we demonstrate that LLMs often exhibit uncertainty in their recommendations. To ensure the trustworthy use of LLMs in generating recommendations, we emphasize the importance of assessing the reliability of recommendations generated by LLMs. We start by introducing a novel framework for estimating the predictive uncertainty to quantitatively measure the reliability of LLM-based recommendations. We further propose to decompose the predictive uncertainty into recommendation uncertainty and prompt uncertainty, enabling in-depth analyses of the primary source of uncertainty. Through extensive experiments, we (1) demonstrate predictive uncertainty effectively indicates the reliability of LLM-based recommendations, (2) investigate the origins of uncertainty with decomposed uncertainty measures, and (3) propose uncertainty-aware prompting for a lower predictive uncertainty and enhanced recommendation. Our source code and model weights are available at https://github.com/WonbinKweon/UNC_LLM_REC_WWW2025
中文: 大语言模型在推荐中常表现出不确定性,为此我们提出了一个评估框架,通过分解预测不确定性来量化可靠性,实验证明该方法能有效指导优化并提升推荐质量。
English: Large language models often show uncertainty in recommendations, so we developed a framework to measure and decompose this uncertainty, proving it effectively indicates reliability and can enhance recommendations through uncertainty-aware prompting.

Authors:Xie Zhang, Chenxiao Li, Chenshu Wu
Title: TAPOR: 3D Hand Pose Reconstruction with Fully Passive Thermal Sensing for Around-Device Interactions
Abstract:
This paper presents the design and implementation of TAPOR, a privacy-preserving, non-contact, and fully passive sensing system for accurate and robust 3D hand pose reconstruction for around-device interaction using a single low-cost thermal array sensor. Thermal sensing using inexpensive and miniature thermal arrays emerges with an excellent utility-privacy balance, offering an imaging resolution significantly lower than cameras but far superior to RF signals like radar or WiFi. The design of TAPOR, however, is challenging, mainly because the captured temperature maps are low-resolution and textureless. To overcome the challenges, we investigate thermo-depth and thermo-pose properties, proposing a novel physics-inspired neural network that learns effective 3D spatial representations of potential hand poses. We then formulate the 3D pose reconstruction problem as a distinct retrieval task, enabling accurate hand pose determination from the input temperature map. To deploy TAPOR on IoT devices, we introduce an effective heterogeneous knowledge distillation method, reducing computation by 377x. TAPOR is fully implemented and tested in real-world scenarios, showing remarkable performance, supported by four gesture control and finger tracking case studies. We envision TAPOR to be a ubiquitous interface for around-device control and have open-sourced it at https://github.com/aiot-lab/TAPOR.
中文: 本文介绍了TAPOR系统,这是一种基于低成本热阵列传感器的隐私保护、非接触式全被动感知系统,通过创新的物理启发神经网络实现精确的3D手部姿态重建,并利用异构知识蒸馏方法在物联网设备上高效部署,支持设备周边的交互控制。
English: This paper introduces TAPOR, a privacy-focused, non-contact, and fully passive system that uses a single low-cost thermal array sensor for accurate 3D hand pose reconstruction, enabling around-device interaction through a novel physics-inspired neural network and efficient deployment on IoT devices.

Authors:Daesoo Lee, Sara Malacarne, Erlend Aune
Title: Closing the Gap Between Synthetic and Ground Truth Time Series Distributions via Neural Mapping
Abstract:
In this paper, we introduce Neural Mapper for Vector Quantized Time Series Generator (NM-VQTSG), a novel method aimed at addressing fidelity challenges in vector quantized (VQ) time series generation. VQ-based methods, such as TimeVQVAE, have demonstrated success in generating time series but are hindered by two critical bottlenecks: information loss during compression into discrete latent spaces and deviations in the learned prior distribution from the ground truth distribution. These challenges result in synthetic time series with compromised fidelity and distributional accuracy. To overcome these limitations, NM-VQTSG leverages a U-Net-based neural mapping model to bridge the distributional gap between synthetic and ground truth time series. To be more specific, the model refines synthetic data by addressing artifacts introduced during generation, effectively aligning the distributions of synthetic and real data. Importantly, NM-VQTSG can be used for synthetic time series generated by any VQ-based generative method. We evaluate NM-VQTSG across diverse datasets from the UCR Time Series Classification archive, demonstrating its capability to consistently enhance fidelity in both unconditional and conditional generation tasks. The improvements are evidenced by significant improvements in FID, IS, and conditional FID, additionally backed up by visual inspection in a data space and a latent space. Our findings establish NM-VQTSG as a new method to improve the quality of synthetic time series. Our implementation is available on \url{https://github.com/ML4ITS/TimeVQVAE}.
中文: 本文提出NM-VQTSG方法,通过U-Net神经网络映射模型修正矢量量化时间序列生成中的分布偏差和伪影,在多个数据集上显著提升了生成数据的保真度和分布准确性。
English: This paper presents NM-VQTSG, a U-Net-based neural mapping model that enhances the fidelity of vector quantized time series generation by correcting distributional gaps and artifacts in synthetic data, demonstrating consistent improvements across multiple datasets and metrics.

Authors:Anh-Kiet Duong, Petra Gomez-Krämer
Title: Action Recognition Using Temporal Shift Module and Ensemble Learning
Abstract:
This paper presents the first-rank solution for the Multi-Modal Action Recognition Challenge, part of the Multi-Modal Visual Pattern Recognition Workshop at the \acl{ICPR} 2024. The competition aimed to recognize human actions using a diverse dataset of 20 action classes, collected from multi-modal sources. The proposed approach is built upon the \acl{TSM}, a technique aimed at efficiently capturing temporal dynamics in video data, incorporating multiple data input types. Our strategy included transfer learning to leverage pre-trained models, followed by meticulous fine-tuning on the challenge's specific dataset to optimize performance for the 20 action classes. We carefully selected a backbone network to balance computational efficiency and recognition accuracy and further refined the model using an ensemble technique that integrates outputs from different modalities. This ensemble approach proved crucial in boosting the overall performance. Our solution achieved a perfect top-1 accuracy on the test set, demonstrating the effectiveness of the proposed approach in recognizing human actions across 20 classes. Our code is available online https://github.com/ffyyytt/TSM-MMVPR.
本文提出了多模态动作识别挑战赛的冠军方案,采用基于TSM的集成方法并结合迁移学习,在20个动作类别上实现了100%的识别准确率。
This paper introduces the top-performing solution for the Multi-Modal Action Recognition Challenge, utilizing a TSM-based ensemble approach with transfer learning to achieve perfect accuracy across 20 action classes.

Authors:Gaole He, Nilay Aishwarya, Ujwal Gadiraju
Title: Is Conversational XAI All You Need? Human-AI Decision Making With a Conversational XAI Assistant
Abstract:
Explainable artificial intelligence (XAI) methods are being proposed to help interpret and understand how AI systems reach specific predictions. Inspired by prior work on conversational user interfaces, we argue that augmenting existing XAI methods with conversational user interfaces can increase user engagement and boost user understanding of the AI system. In this paper, we explored the impact of a conversational XAI interface on users' understanding of the AI system, their trust, and reliance on the AI system. In comparison to an XAI dashboard, we found that the conversational XAI interface can bring about a better understanding of the AI system among users and higher user trust. However, users of both the XAI dashboard and conversational XAI interfaces showed clear overreliance on the AI system. Enhanced conversations powered by large language model (LLM) agents amplified over-reliance. Based on our findings, we reason that the potential cause of such overreliance is the illusion of explanatory depth that is concomitant with both XAI interfaces. Our findings have important implications for designing effective conversational XAI interfaces to facilitate appropriate reliance and improve human-AI collaboration. Code can be found at https://github.com/delftcrowd/IUI2025_ConvXAI
中文: 研究表明,相比仪表板,对话式可解释人工智能界面能提升用户对AI系统的理解和信任,但两种界面均导致用户过度依赖AI,且大语言模型增强的对话会加剧此现象,其根源在于解释深度的错觉。
English: This study demonstrates that conversational XAI interfaces improve user understanding and trust in AI systems compared to dashboards, but both interfaces lead to overreliance, which is amplified by LLM-enhanced conversations due to the illusion of explanatory depth.

Authors:Matt C. Bendel, Saurav K. Shastri, Rizwan Ahmad, Philip Schniter
Title: Solving Inverse Problems using Diffusion with Iterative Colored Renoising
Abstract:
Imaging inverse problems can be solved in an unsupervised manner using pre-trained diffusion models, but doing so requires approximating the gradient of the measurement-conditional score function in the diffusion reverse process. We show that the approximations produced by existing methods are relatively poor, especially early in the reverse process, and so we propose a new approach that iteratively reestimates and "renoises" the estimate several times per diffusion step. This iterative approach, which we call Fast Iterative REnoising (FIRE), injects colored noise that is shaped to ensure that the pre-trained diffusion model always sees white noise, in accordance with how it was trained. We then embed FIRE into the DDIM reverse process and show that the resulting "DDfire" offers state-of-the-art accuracy and runtime on several linear inverse problems, as well as phase retrieval. Our implementation is at https://github.com/matt-bendel/DDfire
中文摘要:提出的FIRE方法通过在扩散过程中迭代重估和重噪估计,改进了用于解决成像逆问题的扩散模型,实现了最先进的精度和运行效率。
English Summary: The proposed FIRE method enhances diffusion models for solving imaging inverse problems by iteratively reestimating and renoising estimates within the diffusion process, achieving state-of-the-art accuracy and efficiency.

Authors:Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, Ling Liu
Title: Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation
Abstract:
Recent research shows that Large Language Models (LLMs) are vulnerable to harmful fine-tuning attacks -- models lose their safety alignment ability after fine-tuning on a few harmful samples. For risk mitigation, a guardrail is typically used to filter out harmful samples before fine-tuning. By designing a new red-teaming method, we in this paper show that purely relying on the moderation guardrail for data filtration is not reliable. Our proposed attack method, dubbed Virus, easily bypasses the guardrail moderation by slightly modifying the harmful data. Experimental results show that the harmful data optimized by Virus is not detectable by the guardrail with up to 100\% leakage ratio, and can simultaneously achieve superior attack performance. Finally, the key message we want to convey through this paper is that: \textbf{it is reckless to consider guardrail moderation as a clutch at straws towards harmful fine-tuning attack}, as it cannot solve the inherent safety issue of the pre-trained LLMs. Our code is available at https://github.com/git-disl/Virus
Chinese: 本研究表明,仅依赖护栏审核过滤有害数据不可靠,因为提出的Virus攻击方法能通过微调有害样本轻松绕过防护,暴露了预训练大语言模型固有的安全隐患。
English: This study reveals that relying solely on guardrail moderation to filter harmful data is unreliable, as the proposed Virus attack method can bypass it by subtly modifying harmful samples, exposing the inherent safety vulnerabilities of pre-trained large language models.

Authors:Sait Sovukluk, Christian Ott
Title: An Efficient Numerical Function Optimization Framework for Constrained Nonlinear Robotic Problems
Abstract:
This paper presents a numerical function optimization framework designed for constrained optimization problems in robotics. The tool is designed with real-time considerations and is suitable for online trajectory and control input optimization problems. The proposed framework does not require any analytical representation of the problem and works with constrained block-box optimization functions. The method combines first-order gradient-based line search algorithms with constraint prioritization through nullspace projections onto constraint Jacobian space. The tool is implemented in C++ and provided online for community use, along with some numerical and robotic example implementations presented in this paper.
中文: 本文提出了一种用于机器人学的实时数值优化框架,通过零空间投影结合梯度算法处理约束黑箱优化问题,并提供了C++开源实现及机器人应用案例。
English: This paper introduces a real-time numerical optimization framework for robotics that handles constrained black-box problems using gradient-based algorithms and constraint prioritization via nullspace projections, with a C++ implementation provided online.

Authors:Chongyu Qu, Ritchie Zhao, Ye Yu, Bin Liu, Tianyuan Yao, Junchao Zhu, Bennett A. Landman, Yucheng Tang, Yuankai Huo
Title: Post-Training Quantization for 3D Medical Image Segmentation: A Practical Study on Real Inference Engines
Abstract:
Quantizing deep neural networks ,reducing the precision (bit-width) of their computations, can remarkably decrease memory usage and accelerate processing, making these models more suitable for large-scale medical imaging applications with limited computational resources. However, many existing methods studied "fake quantization", which simulates lower precision operations during inference, but does not actually reduce model size or improve real-world inference speed. Moreover, the potential of deploying real 3D low-bit quantization on modern GPUs is still unexplored. In this study, we introduce a real post-training quantization (PTQ) framework that successfully implements true 8-bit quantization on state-of-the-art (SOTA) 3D medical segmentation models, i.e., U-Net, SegResNet, SwinUNETR, nnU-Net, UNesT, TransUNet, ST-UNet,and VISTA3D. Our approach involves two main steps. First, we use TensorRT to perform fake quantization for both weights and activations with unlabeled calibration dataset. Second, we convert this fake quantization into real quantization via TensorRT engine on real GPUs, resulting in real-world reductions in model size and inference latency. Extensive experiments demonstrate that our framework effectively performs 8-bit quantization on GPUs without sacrificing model performance. This advancement enables the deployment of efficient deep learning models in medical imaging applications where computational resources are constrained. The code and models have been released, including U-Net, TransUNet pretrained on the BTCV dataset for abdominal (13-label) segmentation, UNesT pretrained on the Whole Brain Dataset for whole brain (133-label) segmentation, and nnU-Net, SegResNet, SwinUNETR and VISTA3D pretrained on TotalSegmentator V2 for full body (104-label) segmentation. https://github.com/hrlblab/PTQ.
Chinese: 本研究提出了一种真实的后训练量化框架,成功在先进的3D医学分割模型上实现真正的8位量化,在GPU上显著减小模型尺寸并降低推理延迟,同时保持模型性能。
English: This study introduces a real post-training quantization framework that successfully implements true 8-bit quantization on state-of-the-art 3D medical segmentation models, significantly reducing model size and inference latency without compromising performance on GPUs.

Authors:Hossein Mirzaei, Mojtaba Nafez, Moein Madadi, Arad Maleki, Mahdi Hajialilue, Zeinab Sadat Taghavi, Sepehr Rezaee, Ali Ansari, Bahar Dibaei Nia, Kian Shamsaie, Mohammadreza Salehi, Mackenzie W. Mathis, Mahdieh Soleymani Baghshah, Mohammad Sabokrou, Mohammad Hossein Rohban
Title: A Contrastive Teacher-Student Framework for Novelty Detection under Style Shifts
Abstract:
There have been several efforts to improve Novelty Detection (ND) performance. However, ND methods often suffer significant performance drops under minor distribution shifts caused by changes in the environment, known as style shifts. This challenge arises from the ND setup, where the absence of out-of-distribution (OOD) samples during training causes the detector to be biased toward the dominant style features in the in-distribution (ID) data. As a result, the model mistakenly learns to correlate style with core features, using this shortcut for detection. Robust ND is crucial for real-world applications like autonomous driving and medical imaging, where test samples may have different styles than the training data. Motivated by this, we propose a robust ND method that crafts an auxiliary OOD set with style features similar to the ID set but with different core features. Then, a task-based knowledge distillation strategy is utilized to distinguish core features from style features and help our model rely on core features for discriminating crafted OOD and ID sets. We verified the effectiveness of our method through extensive experimental evaluations on several datasets, including synthetic and real-world benchmarks, against nine different ND methods.
中文摘要:该研究提出了一种新颖性检测方法,通过构建风格特征相似但核心特征不同的辅助分布外数据集,并利用知识蒸馏技术区分风格与核心特征,从而提升模型在风格变化下的检测鲁棒性。
English Summary: The proposed robust novelty detection method creates an auxiliary out-of-distribution set with matching style features but different core features, then uses knowledge distillation to help the model distinguish between style and core features for improved detection performance.

Authors:David Salinas, Omar Swelam, Frank Hutter
Title: Tuning LLM Judge Design Decisions for 1/1000 of the Cost
Abstract:
Evaluating Large Language Models (LLMs) often requires costly human annotations. To address this, LLM-based judges have been proposed, which compare the outputs of two LLMs enabling the ranking of models without human intervention. While several approaches have been proposed, many confounding factors are present between different papers. For instance the model, the prompt and other hyperparameters are typically changed at the same time making apple-to-apple comparisons challenging. In this paper, we propose to systematically analyze and tune the hyperparameters of LLM judges. To alleviate the high cost of evaluating a judge, we propose to leverage multi-objective multi-fidelity which allows to find judges that trade accuracy for cost and also significantly reduce the cost of the search. Our method identifies judges that not only outperform existing benchmarks in accuracy and cost-efficiency but also utilize open-weight models, ensuring greater accessibility and reproducibility. The code to reproduce our experiments is available at this repository https://github.com/geoalgo/judgetuning .
Chinese: 本文提出一种系统性优化方法,通过多目标多保真度优化调整超参数,使基于大语言模型的评估器在采用开源权重模型时,实现了更高的准确率、成本效益及可复现性。
English: This paper introduces a systematic method to optimize LLM-based judges by tuning hyperparameters using multi-objective multi-fidelity optimization, achieving higher accuracy and cost-efficiency with open-weight models for better accessibility.

Authors:Zhihong Wu, Lishuang Wang, Kebin Sun, Zhuozhao Li, Ran Cheng
Title: Enabling Population-Level Parallelism in Tree-Based Genetic Programming for Comprehensive GPU Acceleration
Abstract:
Tree-based Genetic Programming (TGP) is a widely used evolutionary algorithm for tasks such as symbolic regression, classification, and robotic control. Due to the intensive computational demands of running TGP, GPU acceleration is crucial for achieving scalable performance. However, efficient GPU-based execution of TGP still remains challenging, primarily due to three core issues: (1) the structural heterogeneity of program individuals, (2) the complexity of integrating multiple levels of parallelism, and (3) the incompatibility between high-performance CUDA execution and flexible Python-based environments. To address these issues, we propose EvoGP, a high-performance framework tailored for comprehensive GPU acceleration of TGP via population-level parallel execution. First, EvoGP introduces a tensorized representation that encodes variable-sized trees into fixed-shape, memory-aligned arrays, enabling uniform memory access and parallel computation across diverse individuals. Second, EvoGP adopts an adaptive parallelism strategy that dynamically combines intra- and inter-individual parallelism based on dataset size, ensuring high GPU utilization across a broad spectrum of tasks. Third, EvoGP embeds custom CUDA kernels into the PyTorch runtime, achieving seamless integration with Python-based environments such as Gym, MuJoCo, Brax, and Genesis. Comprehensive experiments show that EvoGP achieves up to 140x speedup over state-of-the-art GPU-based TGP implementations, while maintaining competitive accuracy and significantly improving scalability under large population sizes. EvoGP is open source and accessible at: https://github.com/EMI-Group/evogp.
中文: EvoGP是一个高性能框架,通过张量化表示、自适应并行策略和CUDA-PyTorch集成,实现了树基遗传编程的全面GPU加速,在保持精度的同时获得了高达140倍的加速比并显著提升了可扩展性。
English: EvoGP is a high-performance framework that enables comprehensive GPU acceleration for Tree-based Genetic Programming by using tensorized representations, adaptive parallelism, and CUDA-PyTorch integration, achieving up to 140x speedup while maintaining accuracy and scalability.

Authors:Hossein Mirzaei, Ali Ansari, Bahar Dibaei Nia, Mojtaba Nafez, Moein Madadi, Sepehr Rezaee, Zeinab Sadat Taghavi, Arad Maleki, Kian Shamsaie, Mahdi Hajialilue, Jafar Habibi, Mohammad Sabokrou, Mohammad Hossein Rohban
Title: Scanning Trojaned Models Using Out-of-Distribution Samples
Abstract:
Scanning for trojan (backdoor) in deep neural networks is crucial due to their significant real-world applications. There has been an increasing focus on developing effective general trojan scanning methods across various trojan attacks. Despite advancements, there remains a shortage of methods that perform effectively without preconceived assumptions about the backdoor attack method. Additionally, we have observed that current methods struggle to identify classifiers trojaned using adversarial training. Motivated by these challenges, our study introduces a novel scanning method named TRODO (TROjan scanning by Detection of adversarial shifts in Out-of-distribution samples). TRODO leverages the concept of "blind spots"--regions where trojaned classifiers erroneously identify out-of-distribution (OOD) samples as in-distribution (ID). We scan for these blind spots by adversarially shifting OOD samples towards in-distribution. The increased likelihood of perturbed OOD samples being classified as ID serves as a signature for trojan detection. TRODO is both trojan and label mapping agnostic, effective even against adversarially trained trojaned classifiers. It is applicable even in scenarios where training data is absent, demonstrating high accuracy and adaptability across various scenarios and datasets, highlighting its potential as a robust trojan scanning strategy.
中文摘要:本研究提出了一种名为TRODO的新型木马扫描方法,通过检测分布外样本中的对抗性偏移来识别深度神经网络中的后门,无需训练数据即可有效应对各类攻击场景。
English Summary: The study introduces TRODO, a novel trojan scanning method that detects backdoors in deep neural networks by identifying adversarial shifts in out-of-distribution samples, proving effective across various attacks and scenarios without requiring training data.

Authors:J. Pablo Muñoz, Jinjie Yuan, Nilesh Jain
Title: Mamba-Shedder: Post-Transformer Compression for Efficient Selective Structured State Space Models
Abstract:
Large pre-trained models have achieved outstanding results in sequence modeling. The Transformer block and its attention mechanism have been the main drivers of the success of these models. Recently, alternative architectures, such as Selective Structured State Space Models (SSMs), have been proposed to address the inefficiencies of Transformers. This paper explores the compression of SSM-based models, particularly Mamba and its hybrids. We study the sensitivity of these models to the removal of selected components at different granularities to reduce the model size and computational overhead, thus improving their efficiency while maintaining accuracy. The proposed solutions, collectively referred to as Mamba-Shedder, achieve a speedup of up to 1.4x during inference, demonstrating that model efficiency can be improved by eliminating several redundancies with minimal impact on the overall model performance. The code is available at https://github.com/IntelLabs/Hardware-Aware-Automated-Machine-Learning.
Chinese Summary: 本文提出Mamba-Shedder方法,通过压缩选择性结构化状态空间模型来减少计算开销和模型规模,同时保持精度,实现了高达1.4倍的推理加速。
English Summary: This paper introduces Mamba-Shedder, a method for compressing selective structured state space models to reduce computational overhead and model size while preserving accuracy, achieving up to 1.4x inference speedup.

Authors:Hossein Mirzaei, Mohammad Jafari, Hamid Reza Dehbashi, Ali Ansari, Sepehr Ghobadi, Masoud Hadi, Arshia Soltani Moakhar, Mohammad Azizmalayeri, Mahdieh Soleymani Baghshah, Mohammad Hossein Rohban
Title: RODEO: Robust Outlier Detection via Exposing Adaptive Out-of-Distribution Samples
Abstract:
In recent years, there have been significant improvements in various forms of image outlier detection. However, outlier detection performance under adversarial settings lags far behind that in standard settings. This is due to the lack of effective exposure to adversarial scenarios during training, especially on unseen outliers, leading to detection models failing to learn robust features. To bridge this gap, we introduce RODEO, a data-centric approach that generates effective outliers for robust outlier detection. More specifically, we show that incorporating outlier exposure (OE) and adversarial training can be an effective strategy for this purpose, as long as the exposed training outliers meet certain characteristics, including diversity, and both conceptual differentiability and analogy to the inlier samples. We leverage a text-to-image model to achieve this goal. We demonstrate both quantitatively and qualitatively that our adaptive OE method effectively generates ``diverse'' and ``near-distribution'' outliers, leveraging information from both text and image domains. Moreover, our experimental results show that utilizing our synthesized outliers significantly enhances the performance of the outlier detector, particularly in adversarial settings.
中文: 近年来图像异常检测在对抗性场景下表现不佳,为此我们提出RODEO方法,通过文本到图像模型生成多样且接近正常分布的异常样本,有效提升了检测器在对抗环境下的性能。
English: Recent advances in image outlier detection struggle with adversarial scenarios due to insufficient training exposure, prompting the introduction of RODEO, a data-centric method using text-to-image models to generate diverse and near-distribution outliers that significantly boost detector robustness.

Authors:Nikolaos Kaparinos, Vasileios Mezaris
Title: B-FPGM: Lightweight Face Detection via Bayesian-Optimized Soft FPGM Pruning
Abstract:
Face detection is a computer vision application that increasingly demands lightweight models to facilitate deployment on devices with limited computational resources. Neural network pruning is a promising technique that can effectively reduce network size without significantly affecting performance. In this work, we propose a novel face detection pruning pipeline that leverages Filter Pruning via Geometric Median (FPGM) pruning, Soft Filter Pruning (SFP) and Bayesian optimization in order to achieve a superior trade-off between size and performance compared to existing approaches. FPGM pruning is a structured pruning technique that allows pruning the least significant filters in each layer, while SFP iteratively prunes the filters and allows them to be updated in any subsequent training step. Bayesian optimization is employed in order to optimize the pruning rates of each layer, rather than relying on engineering expertise to determine the optimal pruning rates for each layer. In our experiments across all three subsets of the WIDER FACE dataset, our proposed approach B-FPGM consistently outperforms existing ones in balancing model size and performance. All our experiments were applied to EResFD, the currently smallest (in number of parameters) well-performing face detector of the literature; a small ablation study with a second small face detector, EXTD, is also reported. The source code and trained pruned face detection models can be found at: https://github.com/IDTITI/B-FPGM.
Chinese: 本研究提出了一种结合FPGM、SFP和贝叶斯优化的新型人脸检测剪枝流程,在WIDER FACE数据集上相比现有方法,能更好地平衡模型大小与性能。
English: This study introduces a novel face detection pruning pipeline combining FPGM, SFP, and Bayesian optimization to achieve an optimal balance between model size and performance, consistently outperforming existing methods on the WIDER FACE dataset.

Authors:Shady Nasrat, Myungsu Kim, Seonil Lee, Jiho Lee, Yeoncheol Jang, Seung-joon Yi
Title: RDMM: Fine-Tuned LLM Models for On-Device Robotic Decision Making with Enhanced Contextual Awareness in Specific Domains
Abstract:
Large language models (LLMs) represent a significant advancement in integrating physical robots with AI-driven systems. We showcase the capabilities of our framework within the context of the real-world household competition. This research introduces a framework that utilizes RDMM (Robotics Decision-Making Models), which possess the capacity for decision-making within domain-specific contexts, as well as an awareness of their personal knowledge and capabilities. The framework leverages information to enhance the autonomous decision-making of the system. In contrast to other approaches, our focus is on real-time, on-device solutions, successfully operating on hardware with as little as 8GB of memory. Our framework incorporates visual perception models equipping robots with understanding of their environment. Additionally, the framework has integrated real-time speech recognition capabilities, thus enhancing the human-robot interaction experience. Experimental results demonstrate that the RDMM framework can plan with an 93\% accuracy. Furthermore, we introduce a new dataset consisting of 27k planning instances, as well as 1.3k text-image annotated samples derived from the competition. The framework, benchmarks, datasets, and models developed in this work are publicly available on our GitHub repository at https://github.com/shadynasrat/RDMM.
中文: 本研究提出一种实时、设备端的RDMM框架,通过整合视觉感知与语音识别技术,在家庭任务中实现93%规划准确率的机器人自主决策,并配套发布了新的公开数据集。
English: This research introduces a real-time, on-device RDMM framework that enhances robotic decision-making with 93% planning accuracy by integrating visual perception and speech recognition for household tasks, supported by a new public dataset.

Authors:Arik Reuter, Tim G. J. Rudner, Vincent Fortuin, David Rügamer
Title: Can Transformers Learn Full Bayesian Inference in Context?
Abstract:
Transformers have emerged as the dominant architecture in the field of deep learning, with a broad range of applications and remarkable in-context learning (ICL) capabilities. While not yet fully understood, ICL has already proved to be an intriguing phenomenon, allowing transformers to learn in context -- without requiring further training. In this paper, we further advance the understanding of ICL by demonstrating that transformers can perform full Bayesian inference for commonly used statistical models in context. More specifically, we introduce a general framework that builds on ideas from prior fitted networks and continuous normalizing flows and enables us to infer complex posterior distributions for models such as generalized linear models and latent factor models. Extensive experiments on real-world datasets demonstrate that our ICL approach yields posterior samples that are similar in quality to state-of-the-art MCMC or variational inference methods that do not operate in context. The source code for this paper is available at https://github.com/ArikReuter/ICL_for_Full_Bayesian_Inference.
中文: 本文证明Transformer能够通过上下文学习对统计模型执行完整的贝叶斯推断,无需额外训练即可生成与传统方法质量相当的后验样本。
English: This paper demonstrates that transformers can perform full Bayesian inference for statistical models through in-context learning, producing posterior samples comparable to traditional methods without requiring additional training.

Authors:Lantao Li, Kang Yang, Wenqi Zhang, Xiaoxue Wang, Chen Sun
Title: RG-Attn: Radian Glue Attention for Multi-modality Multi-agent Cooperative Perception
Abstract:
Cooperative perception enhances autonomous driving by leveraging Vehicle-to-Everything (V2X) communication for multi-agent sensor fusion. However, most existing methods rely on single-modal data sharing, limiting fusion performance, particularly in heterogeneous sensor settings involving both LiDAR and cameras across vehicles and roadside units (RSUs). To address this, we propose Radian Glue Attention (RG-Attn), a lightweight and generalizable cross-modal fusion module that unifies intra-agent and inter-agent fusion via transformation-based coordinate alignment and a unified sampling/inversion strategy. RG-Attn efficiently aligns features through a radian-based attention constraint, operating column-wise on geometrically consistent regions to reduce overhead and preserve spatial coherence, thereby enabling accurate and robust fusion. Building upon RG-Attn, we propose three cooperative architectures. The first, Paint-To-Puzzle (PTP), prioritizes communication efficiency but assumes all agents have LiDAR, optionally paired with cameras. The second, Co-Sketching-Co-Coloring (CoS-CoCo), offers maximal flexibility, supporting any sensor setup (e.g., LiDAR-only, camera-only, or both) and enabling strong cross-modal generalization for real-world deployment. The third, Pyramid-RG-Attn Fusion (PRGAF), aims for peak detection accuracy with the highest computational overhead. Extensive evaluations on simulated and real-world datasets show our framework delivers state-of-the-art detection accuracy with high flexibility and efficiency. GitHub Link: https://github.com/LantaoLi/RG-Attn
中文摘要:提出的Radian Glue Attention(RG-Attn)模块通过坐标对齐和统一采样策略实现高效跨模态融合,其三种架构分别在通信效率、传感器兼容性和检测精度方面提供不同优势。
English Summary: The proposed Radian Glue Attention (RG-Attn) module enables efficient cross-modal fusion for cooperative perception through coordinate alignment and unified sampling, with three architectures offering varying balances of communication efficiency, sensor flexibility, and detection accuracy.

Authors:Yinfeng Gao, Qichao Zhang, Da-wei Ding, Dongbin Zhao
Title: Dream to Drive with Predictive Individual World Model
Abstract:
It is still a challenging topic to make reactive driving behaviors in complex urban environments as road users' intentions are unknown. Model-based reinforcement learning (MBRL) offers great potential to learn a reactive policy by constructing a world model that can provide informative states and imagination training. However, a critical limitation in relevant research lies in the scene-level reconstruction representation learning, which may overlook key interactive vehicles and hardly model the interactive features among vehicles and their long-term intentions. Therefore, this paper presents a novel MBRL method with a predictive individual world model (PIWM) for autonomous driving. PIWM describes the driving environment from an individual-level perspective and captures vehicles' interactive relations and their intentions via trajectory prediction task. Meanwhile, a behavior policy is learned jointly with PIWM. It is trained in PIWM's imagination and effectively navigates in the urban driving scenes leveraging intention-aware latent states. The proposed method is trained and evaluated on simulation environments built upon real-world challenging interactive scenarios. Compared with popular model-free and state-of-the-art model-based reinforcement learning methods, experimental results show that the proposed method achieves the best performance in terms of safety and efficiency.
中文摘要:本文提出了一种新颖的基于模型强化学习的方法,通过预测个体世界模型(PIWM)从个体层面捕捉车辆交互关系和意图,在复杂城市驾驶场景中实现了最佳的安全性和效率表现。
English Summary: This paper introduces a novel model-based reinforcement learning method with a Predictive Individual World Model (PIWM) that enhances autonomous driving by capturing vehicle interactions and intentions through trajectory prediction, achieving superior safety and efficiency in urban environments.

Authors:Sunbowen Lee, Shiwen Ni, Chi Wei, Shuaimin Li, Liyang Fan, Ahmadreza Argha, Hamid Alinejad-Rokny, Ruifeng Xu, Yicheng Gong, Min Yang
Title: xJailbreak: Representation Space Guided Reinforcement Learning for Interpretable LLM Jailbreaking
Abstract:
Safety alignment mechanism are essential for preventing large language models (LLMs) from generating harmful information or unethical content. However, cleverly crafted prompts can bypass these safety measures without accessing the model's internal parameters, a phenomenon known as black-box jailbreak. Existing heuristic black-box attack methods, such as genetic algorithms, suffer from limited effectiveness due to their inherent randomness, while recent reinforcement learning (RL) based methods often lack robust and informative reward signals. To address these challenges, we propose a novel black-box jailbreak method leveraging RL, which optimizes prompt generation by analyzing the embedding proximity between benign and malicious prompts. This approach ensures that the rewritten prompts closely align with the intent of the original prompts while enhancing the attack's effectiveness. Furthermore, we introduce a comprehensive jailbreak evaluation framework incorporating keywords, intent matching, and answer validation to provide a more rigorous and holistic assessment of jailbreak success. Experimental results show the superiority of our approach, achieving state-of-the-art (SOTA) performance on several prominent open and closed-source LLMs, including Qwen2.5-7B-Instruct, Llama3.1-8B-Instruct, and GPT-4o-0806. Our method sets a new benchmark in jailbreak attack effectiveness, highlighting potential vulnerabilities in LLMs. The codebase for this work is available at https://github.com/Aegis1863/xJailbreak.
中文摘要:本研究提出一种基于强化学习的黑盒越狱方法,通过分析良性提示与恶意提示的嵌入邻近性来优化提示生成,在多个大语言模型上实现最优性能,并建立了全面的越狱评估框架。
English Summary: The proposed reinforcement learning-based black-box jailbreak method enhances attack effectiveness by optimizing prompts through embedding proximity analysis, achieving state-of-the-art performance on multiple LLMs while introducing a comprehensive evaluation framework.

Authors:Jianing Li, Ming Lu, Hao Wang, Chenyang Gu, Wenzhao Zheng, Li Du, Shanghang Zhang
Title: SliceOcc: Indoor 3D Semantic Occupancy Prediction with Vertical Slice Representation
Abstract:
3D semantic occupancy prediction is a crucial task in visual perception, as it requires the simultaneous comprehension of both scene geometry and semantics. It plays a crucial role in understanding 3D scenes and has great potential for various applications, such as robotic vision perception and autonomous driving. Many existing works utilize planar-based representations such as Bird's Eye View (BEV) and Tri-Perspective View (TPV). These representations aim to simplify the complexity of 3D scenes while preserving essential object information, thereby facilitating efficient scene representation. However, in dense indoor environments with prevalent occlusions, directly applying these planar-based methods often leads to difficulties in capturing global semantic occupancy, ultimately degrading model performance. In this paper, we present a new vertical slice representation that divides the scene along the vertical axis and projects spatial point features onto the nearest pair of parallel planes. To utilize these slice features, we propose SliceOcc, an RGB camera-based model specifically tailored for indoor 3D semantic occupancy prediction. SliceOcc utilizes pairs of slice queries and cross-attention mechanisms to extract planar features from input images. These local planar features are then fused to form a global scene representation, which is employed for indoor occupancy prediction. Experimental results on the EmbodiedScan dataset demonstrate that SliceOcc achieves a mIoU of 15.45% across 81 indoor categories, setting a new state-of-the-art performance among RGB camera-based models for indoor 3D semantic occupancy prediction. Code is available at https://github.com/NorthSummer/SliceOcc.
中文: 本文提出SliceOcc模型,通过垂直切片表示和交叉注意力机制,在室内三维语义占据预测任务中实现了最先进的性能,有效解决了平面表示方法在密集遮挡环境中的局限性。
English: This paper introduces SliceOcc, a novel RGB camera-based model that employs vertical slice representation and cross-attention mechanisms to achieve state-of-the-art performance in indoor 3D semantic occupancy prediction, addressing limitations of planar-based methods in dense environments.

Authors:Shengyuan Liu, Zhen Chen, Qiushi Yang, Weihao Yu, Di Dong, Jiancong Hu, Yixuan Yuan
Title: Polyp-Gen: Realistic and Diverse Polyp Image Generation for Endoscopic Dataset Expansion
Abstract:
Automated diagnostic systems (ADS) have shown significant potential in the early detection of polyps during endoscopic examinations, thereby reducing the incidence of colorectal cancer. However, due to high annotation costs and strict privacy concerns, acquiring high-quality endoscopic images poses a considerable challenge in the development of ADS. Despite recent advancements in generating synthetic images for dataset expansion, existing endoscopic image generation algorithms failed to accurately generate the details of polyp boundary regions and typically required medical priors to specify plausible locations and shapes of polyps, which limited the realism and diversity of the generated images. To address these limitations, we present Polyp-Gen, the first full-automatic diffusion-based endoscopic image generation framework. Specifically, we devise a spatial-aware diffusion training scheme with a lesion-guided loss to enhance the structural context of polyp boundary regions. Moreover, to capture medical priors for the localization of potential polyp areas, we introduce a hierarchical retrieval-based sampling strategy to match similar fine-grained spatial features. In this way, our Polyp-Gen can generate realistic and diverse endoscopic images for building reliable ADS. Extensive experiments demonstrate the state-of-the-art generation quality, and the synthetic images can improve the downstream polyp detection task. Additionally, our Polyp-Gen has shown remarkable zero-shot generalizability on other datasets. The source code is available at https://github.com/CUHK-AIM-Group/Polyp-Gen.
中文: Polyp-Gen是一种全自动基于扩散的框架,通过增强息肉边界细节和利用医学先验知识定位,生成逼真多样的内窥镜图像,以提升自动化诊断系统中的息肉检测能力,并展现出优异的泛化性能。
English: Polyp-Gen is a fully automated diffusion-based framework that generates realistic and diverse endoscopic images for automated diagnostic systems by enhancing polyp boundary details and using medical priors for localization, improving polyp detection and showing strong generalization.

Authors:Shengyuan Liu, Zhen Chen, Qiushi Yang, Weihao Yu, Di Dong, Jiancong Hu, Yixuan Yuan
Title: Polyp-Gen: Realistic and Diverse Polyp Image Generation for Endoscopic Dataset Expansion
Abstract:
Automated diagnostic systems (ADS) have shown significant potential in the early detection of polyps during endoscopic examinations, thereby reducing the incidence of colorectal cancer. However, due to high annotation costs and strict privacy concerns, acquiring high-quality endoscopic images poses a considerable challenge in the development of ADS. Despite recent advancements in generating synthetic images for dataset expansion, existing endoscopic image generation algorithms failed to accurately generate the details of polyp boundary regions and typically required medical priors to specify plausible locations and shapes of polyps, which limited the realism and diversity of the generated images. To address these limitations, we present Polyp-Gen, the first full-automatic diffusion-based endoscopic image generation framework. Specifically, we devise a spatial-aware diffusion training scheme with a lesion-guided loss to enhance the structural context of polyp boundary regions. Moreover, to capture medical priors for the localization of potential polyp areas, we introduce a hierarchical retrieval-based sampling strategy to match similar fine-grained spatial features. In this way, our Polyp-Gen can generate realistic and diverse endoscopic images for building reliable ADS. Extensive experiments demonstrate the state-of-the-art generation quality, and the synthetic images can improve the downstream polyp detection task. Additionally, our Polyp-Gen has shown remarkable zero-shot generalizability on other datasets. The source code is available at https://github.com/CUHK-AIM-Group/Polyp-Gen.
中文: Polyp-Gen是一种全自动基于扩散的框架,通过增强息肉边界细节和利用医学先验知识定位,生成逼真多样的内窥镜图像,以提升自动化诊断系统中的息肉检测能力,并展现出优异的泛化性能。
English: Polyp-Gen is a fully automated diffusion-based framework that generates realistic and diverse endoscopic images for automated diagnostic systems by enhancing polyp boundary details and using medical priors for localization, improving polyp detection and showing strong generalization.

Authors:Aashish Yadavally, Hoan Nguyen, Laurent Callot, Gauthier Guinet
Title: Large Language Model Critics for Execution-Free Evaluation of Code Changes
Abstract:
Large language models (LLMs) offer a promising way forward for automating software engineering tasks, such as bug fixes, feature additions, etc., via multi-step LLM-based agentic workflows. However, existing metrics for evaluating such workflows, mainly build status and occasionally log analysis, are too sparse and limited in providing the information needed to assess the quality of changes made. In this work, we designed LLM-based critics to derive well-structured and rigorous intermediate/step-level, execution-free evaluation proxies for repo-level code changes. Importantly, we assume access to the gold test patch for the problem (i.e., reference-aware) to assess both semantics and executability of generated patches. With the gold test patch as a reference, we predict executability of all editing locations with an F1 score of 91.6%, aggregating which, we can predict the build status in 84.8% of the instances in SWE-bench. In particular, such an execution-focused LLM critic outperforms other reference-free and reference-aware LLM critics by 38.9% to 72.5%. Moreover, we demonstrate the usefulness of such a reference-aware framework in comparing patches generated by different agentic workflows. Finally, we open-source the library developed for this project, which allows further usage for either other agentic workflows or other benchmarks. The source code is available at https://github.com/amazon-science/code-agent-eval.
中文摘要:基于大语言模型的评审机制被设计用于提供严格的分步评估指标,通过参考黄金测试补丁来高精度评估代码语义和可执行性,显著优于其他方法,并能有效比较不同智能工作流程的补丁质量。
English Summary: Large language model-based critics are designed to provide rigorous, step-level evaluation proxies for code changes, using gold test patches to assess semantics and executability with high accuracy, outperforming other methods and enabling effective comparison of agentic workflows.

Authors:Jinlan Fu, Shenzhen Huangfu, Hao Fei, Xiaoyu Shen, Bryan Hooi, Xipeng Qiu, See-Kiong Ng
Title: CHiP: Cross-modal Hierarchical Direct Preference Optimization for Multimodal LLMs
Abstract:
Multimodal Large Language Models (MLLMs) still struggle with hallucinations despite their impressive capabilities. Recent studies have attempted to mitigate this by applying Direct Preference Optimization (DPO) to multimodal scenarios using preference pairs from text-based responses. However, our analysis of representation distributions reveals that multimodal DPO struggles to align image and text representations and to distinguish between hallucinated and non-hallucinated descriptions. To address these challenges, in this work, we propose a Cross-modal Hierarchical Direct Preference Optimization (CHiP) to address these limitations. We introduce a visual preference optimization module within the DPO framework, enabling MLLMs to learn from both textual and visual preferences simultaneously. Furthermore, we propose a hierarchical textual preference optimization module that allows the model to capture preferences at multiple granular levels, including response, segment, and token levels. We evaluate CHiP through both quantitative and qualitative analyses, with results across multiple benchmarks demonstrating its effectiveness in reducing hallucinations. On the Object HalBench dataset, CHiP outperforms DPO in hallucination reduction, achieving improvements of 52.7% and 55.5% relative points based on the base model Muffin and LLaVA models, respectively. We make all our datasets and code publicly available: https://github.com/LVUGAI/CHiP.
中文: 提出的跨模态分层直接偏好优化(CHiP)方法通过整合视觉和分层文本偏好,有效减少多模态模型中的幻觉现象,在多个基准测试中表现优于现有技术。
English: The proposed Cross-modal Hierarchical Direct Preference Optimization (CHiP) method enhances multimodal models by integrating visual and hierarchical textual preferences, significantly reducing hallucinations and outperforming existing techniques on benchmarks.

Authors:Ali Safarpoor Dehkordi, Ahad N. Zehmakan
Title: More Efficient Sybil Detection Mechanisms Leveraging Resistance of Users to Attack Requests
Abstract:
We investigate the problem of sybil (fake account) detection in social networks from a graph algorithms perspective, where graph structural information is used to classify users as sybil and benign. We introduce the novel notion of user resistance to attack requests (friendship requests from sybil accounts). Building on this notion, we propose a synthetic graph data generation framework that supports various attack strategies. We then study the optimization problem where we are allowed to reveal the resistance of a subset of users with the aim to maximize the number of users which are discovered to be benign and the number of potential attack edges (connections from a sybil to a benign user). Furthermore, we devise efficient algorithms for this problem and investigate their theoretical guarantees. Finally, through a large set of experiments, we demonstrate that our proposed algorithms improve detection performance notably when applied as a preprocessing step for different sybil detection algorithms. The code and data used in this work are publicly available on GitHub https://github.com/aSafarpoor/AAMAS2025-Paper/tree/main
中文摘要:本研究提出一种基于图结构的新型Sybil检测方法,通过定义用户对攻击请求的抵抗性,构建合成数据生成框架,并开发高效优化算法,显著提升了多种检测方法的性能表现。
English Summary: This study introduces a novel graph-based approach for sybil detection by defining user resistance to attack requests, developing a synthetic data generation framework, and proposing efficient optimization algorithms that significantly enhance detection performance across various methods.

Authors:Wenfeng Lin, Jiangchuan Wei, Boyuan Liu, Yichen Zhang, Shiyue Yan, Mingyu Guo
Title: CascadeV: An Implementation of Wurstchen Architecture for Video Generation
Abstract:
Recently, with the tremendous success of diffusion models in the field of text-to-image (T2I) generation, increasing attention has been directed toward their potential in text-to-video (T2V) applications. However, the computational demands of diffusion models pose significant challenges, particularly in generating high-resolution videos with high frame rates. In this paper, we propose CascadeV, a cascaded latent diffusion model (LDM), that is capable of producing state-of-the-art 2K resolution videos. Experiments demonstrate that our cascaded model achieves a higher compression ratio, substantially reducing the computational challenges associated with high-quality video generation. We also implement a spatiotemporal alternating grid 3D attention mechanism, which effectively integrates spatial and temporal information, ensuring superior consistency across the generated video frames. Furthermore, our model can be cascaded with existing T2V models, theoretically enabling a 4$\times$ increase in resolution or frames per second without any fine-tuning. Our code is available at https://github.com/bytedance/CascadeV.
中文: 本文提出CascadeV,一种级联潜在扩散模型,通过采用时空交替网格注意力机制和实现更高压缩比,有效生成高质量2K视频并显著降低计算负担。
English: The paper introduces CascadeV, a cascaded latent diffusion model that efficiently generates high-resolution 2K videos by leveraging a spatiotemporal attention mechanism and achieving higher compression to reduce computational demands.

Authors:Xiaolei Liu, Yan Sun, Zhiliang Wang, Mark Nixon
Title: Unsupervised Domain Adaptation with Dynamic Clustering and Contrastive Refinement for Gait Recognition
Abstract:
Gait recognition is an emerging identification technology that distinguishes individuals at long distances by analyzing individual walking patterns. Traditional techniques rely heavily on large-scale labeled datasets, which incurs high costs and significant labeling challenges. Recently, researchers have explored unsupervised gait recognition with clustering-based unsupervised domain adaptation methods and achieved notable success. However, these methods directly use pseudo-label generated by clustering and neglect pseudolabel noise caused by domain differences, which affects the effect of the model training process. To mitigate these issues, we proposed a novel model called GaitDCCR, which aims to reduce the influence of noisy pseudo labels on clustering and model training. Our approach can be divided into two main stages: clustering and training stage. In the clustering stage, we propose Dynamic Cluster Parameters (DCP) and Dynamic Weight Centroids (DWC) to improve the efficiency of clustering and obtain reliable cluster centroids. In the training stage, we employ the classical teacher-student structure and propose Confidence-based Pseudo-label Refinement (CPR) and Contrastive Teacher Module (CTM) to encourage noisy samples to converge towards clusters containing their true identities. Extensive experiments on public gait datasets have demonstrated that our simple and effective method significantly enhances the performance of unsupervised gait recognition, laying the foundation for its application in the real-world. We will release the code at https://github.com/YanSun-github/GaitDCCR upon acceptance.
中文: 提出的GaitDCCR模型通过动态聚类优化和师生训练框架,有效解决了无监督步态识别中的伪标签噪声问题,在公开数据集上显著提升了识别性能。
English: The proposed GaitDCCR model addresses noisy pseudo-labels in unsupervised gait recognition through dynamic clustering optimization and a teacher-student training framework, significantly improving recognition accuracy on public datasets.

Authors:Zheng Lian, Haoyu Chen, Lan Chen, Haiyang Sun, Licai Sun, Yong Ren, Zebang Cheng, Bin Liu, Rui Liu, Xiaojiang Peng, Jiangyan Yi, Jianhua Tao
Title: AffectGPT: A New Dataset, Model, and Benchmark for Emotion Understanding with Multimodal Large Language Models
Abstract:
The emergence of multimodal large language models (MLLMs) advances multimodal emotion recognition (MER) to the next level, from naive discriminative tasks to complex emotion understanding with advanced video understanding abilities and natural language description. However, the current community suffers from a lack of large-scale datasets with intensive, descriptive emotion annotations, as well as a multimodal-centric framework to maximize the potential of MLLMs for emotion understanding. To address this, we establish a new benchmark for MLLM-based emotion understanding with a novel dataset (MER-Caption) and a new model (AffectGPT). Utilizing our model-based crowd-sourcing data collection strategy, we construct the largest descriptive emotion dataset to date (by far), featuring over 2K fine-grained emotion categories across 115K samples. We also introduce the AffectGPT model, designed with pre-fusion operations to enhance multimodal integration. Finally, we present MER-UniBench, a unified benchmark with evaluation metrics tailored for typical MER tasks and the free-form, natural language output style of MLLMs. Extensive experimental results show AffectGPT's robust performance across various MER tasks. We have released both the code and the dataset to advance research and development in emotion understanding: https://github.com/zeroQiaoba/AffectGPT.
中文: 作者提出了AffectGPT模型和新数据集,以解决多模态情感识别中缺乏描述性标注和专用框架的问题,并在多种任务中展现出优异性能。
English: The authors introduce AffectGPT and a new dataset to advance multimodal emotion recognition by addressing the lack of descriptive annotations and specialized frameworks, achieving strong performance across various tasks.

Authors:Robert O'Shea, Bipin Rajendran
Title: Closed-Form Feedback-Free Learning with Forward Projection
Abstract:
State-of-the-art methods for backpropagation-free learning employ local error feedback to direct iterative optimisation via gradient descent. In this study, we examine the more restrictive setting where retrograde communication from neuronal outputs is unavailable for pre-synaptic weight optimisation. To address this challenge, we propose Forward Projection (FP). This novel randomised closed-form training method requires only a single forward pass over the entire dataset for model fitting, without retrograde communication. Target values for pre-activation membrane potentials are generated layer-wise via nonlinear projections of pre-synaptic inputs and the labels. Local loss functions are optimised over pre-synaptic inputs using closed-form regression, without feedback from neuronal outputs or downstream layers. Interpretability is a key advantage of FP training; membrane potentials of hidden neurons in FP-trained networks encode information which is interpretable layer-wise as label predictions. We demonstrate the effectiveness of FP across four biomedical datasets. In few-shot learning tasks, FP yielded more generalisable models than those optimised via backpropagation. In large-sample tasks, FP-based models achieve generalisation comparable to gradient descent-based local learning methods while requiring only a single forward propagation step, achieving significant speed up for training. Interpretation functions defined on local neuronal activity in FP-based models successfully identified clinically salient features for diagnosis in two biomedical datasets. Forward Projection is a computationally efficient machine learning approach that yields interpretable neural network models without retrograde communication of neuronal activity during training.
中文: 前向投影是一种无需反向传播的新型训练方法,通过单次前向传播和闭式回归构建可解释神经网络,在保持竞争力的泛化能力同时显著提升计算效率。
English: Forward Projection is a novel, backpropagation-free training method that uses single-pass forward propagation and closed-form regression to create interpretable neural networks, achieving competitive generalization with significant computational efficiency.

Authors:Simon Dahan, Gabriel Bénédict, Logan Z. J. Williams, Yourong Guo, Daniel Rueckert, Robert Leech, Emma C. Robinson
Title: SIM: Surface-based fMRI Analysis for Inter-Subject Multimodal Decoding from Movie-Watching Experiments
Abstract:
Current AI frameworks for brain decoding and encoding, typically train and test models within the same datasets. This limits their utility for brain computer interfaces (BCI) or neurofeedback, for which it would be useful to pool experiences across individuals to better simulate stimuli not sampled during training. A key obstacle to model generalisation is the degree of variability of inter-subject cortical organisation, which makes it difficult to align or compare cortical signals across participants. In this paper we address this through the use of surface vision transformers, which build a generalisable model of cortical functional dynamics, through encoding the topography of cortical networks and their interactions as a moving image across a surface. This is then combined with tri-modal self-supervised contrastive (CLIP) alignment of audio, video, and fMRI modalities to enable the retrieval of visual and auditory stimuli from patterns of cortical activity (and vice-versa). We validate our approach on 7T task-fMRI data from 174 healthy participants engaged in the movie-watching experiment from the Human Connectome Project (HCP). Results show that it is possible to detect which movie clips an individual is watching purely from their brain activity, even for individuals and movies not seen during training. Further analysis of attention maps reveals that our model captures individual patterns of brain activity that reflect semantic and visual systems. This opens the door to future personalised simulations of brain function. Code & pre-trained models will be made available at https://github.com/metrics-lab/sim, processed data for training will be available upon request at https://gin.g-node.org/Sdahan30/sim.
中文摘要:现有脑解码AI模型难以跨个体泛化,本研究通过表面视觉变换器结合多模态对比学习,实现了从未见过的受试者和电影中仅凭大脑活动准确识别刺激内容。
English Summary: Current AI models for brain decoding struggle with generalization across individuals, but this study introduces a surface vision transformer combined with multimodal contrastive learning to enable accurate stimulus retrieval from brain activity, even for unseen participants and movies.

Authors:George Wright, Slawomir Michniewski, Eleanor Jameson, Fayyaz ul Amir Afsar Minhas
Title: DepoRanker: A Web Tool to predict Klebsiella Depolymerases using Machine Learning
Abstract:
Background: Phage therapy shows promise for treating antibiotic-resistant Klebsiella infections. Identifying phage depolymerases that target Klebsiella capsular polysaccharides is crucial, as these capsules contribute to biofilm formation and virulence. However, homology-based searches have limitations in novel depolymerase discovery. Objective: To develop a machine learning model for identifying and ranking potential phage depolymerases targeting Klebsiella. Methods: We developed DepoRanker, a machine learning algorithm to rank proteins by their likelihood of being depolymerases. The model was experimentally validated on 5 newly characterized proteins and compared to BLAST. Results: DepoRanker demonstrated superior performance to BLAST in identifying potential depolymerases. Experimental validation confirmed its predictive ability on novel proteins. Conclusions: DepoRanker provides an accurate and functional tool to expedite depolymerase discovery for phage therapy against Klebsiella. It is available as a webserver and open-source software. Availability: Webserver: https://deporanker.dcs.warwick.ac.uk/ Source code: https://github.com/wgrgwrght/deporanker
Chinese: 研究人员开发了DepoRanker机器学习工具,在识别靶向克雷伯菌的噬菌体解聚酶方面优于BLAST方法,为针对耐药感染的噬菌体治疗加速了解聚酶的发现进程。
English: Researchers developed DepoRanker, a machine learning tool that outperforms BLAST in identifying phage depolymerases targeting Klebsiella, accelerating discovery for phage therapy against antibiotic-resistant infections.

Authors:Yash Yardi, Samuel Biruduganti, Lars Ankile
Title: Bridging the Sim2Real Gap: Vision Encoder Pre-Training for Visuomotor Policy Transfer
Abstract:
Simulation offers a scalable and efficient alternative to real-world data collection for learning visuomotor robotic policies. However, the simulation-to-reality, or Sim2Real distribution shift -- introduced by employing simulation-trained policies in real-world environments -- frequently prevents successful policy transfer. We present an offline framework to evaluate the performance of using large-scale pre-trained vision encoders to address the Sim2Real gap. We examine a diverse collection of encoders, assessing their ability to extract features necessary for robot control (Action Score) while remaining invariant to task-irrelevant environmental variations (Domain Invariance Score). Evaluating 23 encoders, we reveal patterns across architectures, pre-training datasets, and parameter scales. Our findings show that manipulation-pretrained encoders consistently achieve higher Action Scores, CNN-based encoders demonstrate stronger domain invariance than ViTs, and the best-performing models combine both properties, underscoring DIS and AS as complementary predictors of Sim2Real transferability.
中文: 本研究评估了大规模预训练视觉编码器以弥合Sim2Real差距,发现操作预训练模型在动作相关性上表现优异,CNN比ViT具有更强的域不变性,而最佳模型综合了这两种特性以实现有效的策略迁移。
English: This study evaluates large-scale pre-trained vision encoders to bridge the Sim2Real gap, finding that manipulation-pretrained models excel in action relevance, CNNs offer superior domain invariance over ViTs, and top performers combine both traits for effective policy transfer.

Authors:J. Pablo Muñoz, Jinjie Yuan, Nilesh Jain
Title: Low-Rank Adapters Meet Neural Architecture Search for LLM Compression
Abstract:
The rapid expansion of Large Language Models (LLMs) has posed significant challenges regarding the computational resources required for fine-tuning and deployment. Recent advancements in low-rank adapters have demonstrated their efficacy in parameter-efficient fine-tuning (PEFT) of these models. This retrospective paper comprehensively discusses innovative approaches that synergize low-rank representations with Neural Architecture Search (NAS) techniques, particularly weight-sharing super-networks. Robust solutions for compressing and fine-tuning large pre-trained models are developed by integrating these methodologies. Our analysis highlights the potential of these combined strategies to democratize the use of LLMs, making them more accessible for deployment in resource-constrained environments. The resulting models exhibit reduced memory footprints and faster inference times, paving the way for more practical and scalable applications of LLMs. Models and code are available at https://github.com/IntelLabs/Hardware-Aware-Automated-Machine-Learning.
中文: 本文探讨了将低秩适配器与神经架构搜索相结合的方法,以高效微调和压缩大型语言模型,使其在资源受限环境中实现内存减少和推理加速的实际部署。
English: This paper explores combining low-rank adapters with Neural Architecture Search to efficiently fine-tune and compress Large Language Models, enabling their practical deployment in resource-limited settings with reduced memory and faster inference.

Authors:Heting Gao, Hang Shao, Xiong Wang, Chaofan Qiu, Yunhang Shen, Siqi Cai, Yuchen Shi, Zihan Xu, Zuwei Long, Yike Zhang, Shaoqi Dong, Chaoyou Fu, Ke Li, Long Ma, Xing Sun
Title: LUCY: Linguistic Understanding and Control Yielding Early Stage of Her
Abstract:
The film Her features Samantha, a sophisticated AI audio agent who is capable of understanding both linguistic and paralinguistic information in human speech and delivering real-time responses that are natural, informative and sensitive to emotional subtleties. Moving one step toward more sophisticated audio agent from recent advancement in end-to-end (E2E) speech systems, we propose LUCY, a E2E speech model that (1) senses and responds to user's emotion, (2) deliver responses in a succinct and natural style, and (3) use external tool to answer real-time inquiries. Experiment results show that LUCY is better at emotion control than peer models, generating emotional responses based on linguistic emotional instructions and responding to paralinguistic emotional cues. Lucy is also able to generate responses in a more natural style, as judged by external language models, without sacrificing much performance on general question answering. Finally, LUCY can leverage function calls to answer questions that are out of its knowledge scope.
中文:电影《她》中的AI萨曼莎能理解语言和情感线索,启发了LUCY这一端到端语音模型的开发,该模型在情感响应、自然对话生成及调用外部工具处理实时查询方面表现卓越。
English: The film "Her" depicts Samantha, an advanced AI that comprehends both linguistic and emotional cues in speech, inspiring the development of LUCY, an end-to-end speech model that excels in emotional responsiveness, natural dialogue generation, and utilizing external tools for real-time inquiries.

Authors:Weixin Liang, Junhong Shen, Genghan Zhang, Ning Dong, Luke Zettlemoyer, Lili Yu
Title: Mixture-of-Mamba: Enhancing Multi-Modal State-Space Models with Modality-Aware Sparsity
Abstract:
State Space Models (SSMs) have emerged as efficient alternatives to Transformers for sequential modeling, but their inability to leverage modality-specific features limits their performance in multi-modal pretraining. Here, we propose Mixture-of-Mamba, a novel SSM architecture that introduces modality-aware sparsity through modality-specific parameterization of the Mamba block. Building on Mixture-of-Transformers (W. Liang et al. arXiv:2411.04996; 2024), we extend the benefits of modality-aware sparsity to SSMs while preserving their computational efficiency. We evaluate Mixture-of-Mamba across three multi-modal pretraining settings: Transfusion (interleaved text and continuous image tokens with diffusion loss), Chameleon (interleaved text and discrete image tokens), and an extended three-modality framework incorporating speech. Mixture-of-Mamba consistently reaches the same loss values at earlier training steps with significantly reduced computational costs. In the Transfusion setting, Mixture-of-Mamba achieves equivalent image loss using only 34.76% of the training FLOPs at the 1.4B scale. In the Chameleon setting, Mixture-of-Mamba reaches similar image loss with just 42.50% of the FLOPs at the 1.4B scale, and similar text loss with just 65.40% of the FLOPs. In the three-modality setting, MoM matches speech loss at 24.80% of the FLOPs at the 1.4B scale. Our ablation study highlights the synergistic effects of decoupling projection components, where joint decoupling yields greater gains than individual modifications. These results establish modality-aware sparsity as a versatile and effective design principle, extending its impact from Transformers to SSMs and setting new benchmarks in multi-modal pretraining. Our code can be accessed at https://github.com/Weixin-Liang/Mixture-of-Mamba
中文: Mixture-of-Mamba通过模态感知稀疏性改进了状态空间模型,在文本、图像和语音的多模态预训练中,能以显著降低的计算成本实现同等性能。
English: Mixture-of-Mamba introduces modality-aware sparsity into State Space Models, enabling efficient multi-modal pretraining by achieving comparable performance with significantly reduced computational costs across text, image, and speech tasks.

Authors:Younggun Kim, Mohamed Abdel-Aty, Beomsik Cho, Seonghoon Ryoo, Soomok Lee
Title: MSCN: Multi-view Structural Convolution Network for Domain-Invariant Point Cloud Recognition of Autonomous Vehicles
Abstract:
Although LiDAR sensors have become indispensable for autonomous vehicles (AVs) due to their ability to provide accurate 3D scene understanding and robust perception under adverse weather conditions, the properties of LiDAR point clouds vary widely across sensor configurations and data acquisition domains, leading to severe performance degradation when models are transferred between heterogeneous sensors or from simulation to the real world. To address this challenge, we propose the Multi-view Structural Convolution Network (MSCN), a novel architecture designed to achieve domain-invariant recognition across diverse LiDAR configurations and environments. MSCN comprises Structural Convolution Layers (SCL) that extract local context geometric features from point clouds and Structural Aggregation Layers (SAL) that extract and aggregate both local and overall context features from point clouds. Furthermore, we incorporate an unseen domain generation strategy to mitigate domain gaps during training. Extensive experiments demonstrate that MSCN consistently outperforms state-of-the-art point cloud classification methods across all domain change scenarios. These results highlight MSCN as a scalable solution for deploying LiDAR-based perception systems of AVs. Our code is available at https://github.com/MLMLab/MSCN.
中文摘要:提出的多视角结构卷积网络(MSCN)通过特殊结构层和领域生成策略有效解决激光雷达领域适应问题,在不同传感器配置和环境中均展现出卓越性能。
English Summary: The proposed Multi-view Structural Convolution Network (MSCN) effectively addresses LiDAR domain adaptation challenges through specialized structural layers and domain generation strategies, demonstrating superior performance across varied sensor configurations and environments.

Authors:Jacopo Di Ventura, Dylan R. Ashley, Vincent Herrmann, Francesco Faccio, Jürgen Schmidhuber
Title: Upside Down Reinforcement Learning with Policy Generators
Abstract:
Upside Down Reinforcement Learning (UDRL) is a promising framework for solving reinforcement learning problems which focuses on learning command-conditioned policies. In this work, we extend UDRL to the task of learning a command-conditioned generator of deep neural network policies. We accomplish this using Hypernetworks - a variant of Fast Weight Programmers, which learn to decode input commands representing a desired expected return into command-specific weight matrices. Our method, dubbed Upside Down Reinforcement Learning with Policy Generators (UDRLPG), streamlines comparable techniques by removing the need for an evaluator or critic to update the weights of the generator. To counteract the increased variance in last returns caused by not having an evaluator, we decouple the sampling probability of the buffer from the absolute number of policies in it, which, together with a simple weighting strategy, improves the empirical convergence of the algorithm. Compared with existing algorithms, UDRLPG achieves competitive performance and high returns, sometimes outperforming more complex architectures. Our experiments show that a trained generator can generalize to create policies that achieve unseen returns zero-shot. The proposed method appears to be effective in mitigating some of the challenges associated with learning highly multimodal functions. Altogether, we believe that UDRLPG represents a promising step forward in achieving greater empirical sample efficiency in RL. A full implementation of UDRLPG is publicly available at https://github.com/JacopoD/udrlpg_
Chinese: UDRLPG通过使用超网络扩展了倒置强化学习框架,能够根据指令生成特定神经网络策略,无需评估器即可实现优异性能,并通过解耦采样和加权策略提高了算法收敛性。
English: UDRLPG extends Upside Down Reinforcement Learning by employing Hypernetworks to generate command-specific neural network policies, eliminating the need for an evaluator and achieving competitive performance with improved convergence through decoupled sampling and weighting strategies.

Authors:Li Pang, Jing Yao, Kaiyu Li, Xiangyong Cao
Title: SPECIAL: Zero-shot Hyperspectral Image Classification With CLIP
Abstract:
Hyperspectral image (HSI) classification aims at categorizing each pixel in an HSI into a specific land cover class, which is crucial for applications like remote sensing, environmental monitoring, and agriculture. Although deep learning-based HSI classification methods have achieved significant advancements, existing methods still rely on manually labeled data for training, which is both time-consuming and labor-intensive. To address this limitation, we introduce a novel zero-shot hyperspectral image classification framework based on CLIP (SPECIAL), aiming to eliminate the need for manual annotations. The SPECIAL framework consists of two main stages: (1) CLIP-based pseudo-label generation, and (2) noisy label learning. In the first stage, HSI is spectrally interpolated to produce RGB bands. These bands are subsequently classified using CLIP, resulting in noisy pseudo-labels that are accompanied by confidence scores. To improve the quality of these labels, we propose a scaling strategy that fuses predictions from multiple spatial scales. In the second stage, spectral information and a label refinement technique are incorporated to mitigate label noise and further enhance classification accuracy. Experimental results on three benchmark datasets demonstrate that our SPECIAL outperforms existing methods in zero-shot HSI classification, showing its potential for more practical applications. The code is available at https://github.com/LiPang/SPECIAL.
中文: 本文提出SPECIAL零样本高光谱图像分类框架,通过CLIP生成伪标签并结合多尺度融合与噪声处理技术消除人工标注需求,在基准数据集上实现了优越性能。
English: This paper introduces SPECIAL, a zero-shot hyperspectral image classification framework that eliminates manual annotations by generating CLIP-based pseudo-labels and refining them through multi-scale fusion and noise reduction techniques, achieving superior performance on benchmark datasets.

Authors:Tatiana Taís Schein, Gustavo Pereira de Almeira, Stephanie Loi Brião, Rodrigo Andrade de Bem, Felipe Gomes de Oliveira, Paulo L. J. Drews-Jr
Title: UDBE: Unsupervised Diffusion-based Brightness Enhancement in Underwater Images
Abstract:
Activities in underwater environments are paramount in several scenarios, which drives the continuous development of underwater image enhancement techniques. A major challenge in this domain is the depth at which images are captured, with increasing depth resulting in a darker environment. Most existing methods for underwater image enhancement focus on noise removal and color adjustment, with few works dedicated to brightness enhancement. This work introduces a novel unsupervised learning approach to underwater image enhancement using a diffusion model. Our method, called UDBE, is based on conditional diffusion to maintain the brightness details of the unpaired input images. The input image is combined with a color map and a Signal-Noise Relation map (SNR) to ensure stable training and prevent color distortion in the output images. The results demonstrate that our approach achieves an impressive accuracy rate in the datasets UIEB, SUIM and RUIE, well-established underwater image benchmarks. Additionally, the experiments validate the robustness of our approach, regarding the image quality metrics PSNR, SSIM, UIQM, and UISM, indicating the good performance of the brightness enhancement process. The source code is available here: https://github.com/gusanagy/UDBE.
中文摘要:本文提出了一种名为UDBE的无监督扩散模型,用于水下图像增强,能有效提升亮度同时保留细节并防止色彩失真,在多个基准数据集上表现出优异的准确性和鲁棒性。
English Summary: This paper presents UDBE, an unsupervised diffusion model for underwater image enhancement that effectively improves brightness while preserving details and preventing color distortion, achieving high accuracy and robustness on benchmark datasets.

Authors:Wenxuan Xie, Fanpu Cao
Title: SWIFT: Mapping Sub-series with Wavelet Decomposition Improves Time Series Forecasting
Abstract:
In recent work on time-series prediction, Transformers and even large language models have garnered significant attention due to their strong capabilities in sequence modeling. However, in practical deployments, time-series prediction often requires operation in resource-constrained environments, such as edge devices, which are unable to handle the computational overhead of large models. To address such scenarios, some lightweight models have been proposed, but they exhibit poor performance on non-stationary sequences. In this paper, we propose $\textit{SWIFT}$, a lightweight model that is not only powerful, but also efficient in deployment and inference for Long-term Time Series Forecasting (LTSF). Our model is based on three key points: (i) Utilizing wavelet transform to perform lossless downsampling of time series. (ii) Achieving cross-band information fusion with a learnable filter. (iii) Using only one shared linear layer or one shallow MLP for sub-series' mapping. We conduct comprehensive experiments, and the results show that $\textit{SWIFT}$ achieves state-of-the-art (SOTA) performance on multiple datasets, offering a promising method for edge computing and deployment in this task. Moreover, it is noteworthy that the number of parameters in $\textit{SWIFT-Linear}$ is only 25\% of what it would be with a single-layer linear model for time-domain prediction. Our code is available at https://github.com/LancelotXWX/SWIFT.
中文: 提出的轻量级模型SWIFT通过小波变换和跨频段融合技术,在长期时间序列预测中实现了最先进的性能,同时仅需极少计算资源,非常适合边缘设备部署。
English: The proposed lightweight model SWIFT achieves state-of-the-art performance for long-term time series forecasting by employing wavelet transform and cross-band fusion while requiring only minimal computational resources, making it ideal for edge deployment.

Authors:Anh-Kiet Duong, Petra Gomez-Krämer
Title: Addressing Out-of-Label Hazard Detection in Dashcam Videos: Insights from the COOOL Challenge
Abstract:
This paper presents a novel approach for hazard analysis in dashcam footage, addressing the detection of driver reactions to hazards, the identification of hazardous objects, and the generation of descriptive captions. We first introduce a method for detecting driver reactions through speed and sound anomaly detection, leveraging unsupervised learning techniques. For hazard detection, we employ a set of heuristic rules as weak classifiers, which are combined using an ensemble method. This ensemble approach is further refined with differential privacy to mitigate overconfidence, ensuring robustness despite the lack of labeled data. Lastly, we use state-of-the-art vision-language models for hazard captioning, generating descriptive labels for the detected hazards. Our method achieved the highest scores in the Challenge on Out-of-Label in Autonomous Driving, demonstrating its effectiveness across all three tasks. Source codes are publicly available at https://github.com/ffyyytt/COOOL_2025.
中文: 本文提出了一种新颖的行车记录仪危险分析方法,通过速度和声音异常检测驾驶反应,采用集成启发式规则结合差分隐私识别危险物体,并利用先进视觉语言模型生成描述性标签,在COOOL 2025挑战赛中取得最佳成绩。
English: This paper introduces a novel method for analyzing hazards in dashcam footage by detecting driver reactions through speed and sound anomalies, identifying hazardous objects using an ensemble of heuristic rules enhanced with differential privacy, and generating descriptive captions with vision-language models, achieving top performance in the COOOL 2025 challenge.

Authors:Zhibo Ren, Pritthijit Nath, Pancham Shukla
Title: Improving Tropical Cyclone Forecasting With Video Diffusion Models
Abstract:
Tropical cyclone (TC) forecasting is crucial for disaster preparedness and mitigation. While recent deep learning approaches have shown promise, existing methods often treat TC evolution as a series of independent frame-to-frame predictions, limiting their ability to capture long-term dynamics. We present a novel application of video diffusion models for TC forecasting that explicitly models temporal dependencies through additional temporal layers. Our approach enables the model to generate multiple frames simultaneously, better capturing cyclone evolution patterns. We introduce a two-stage training strategy that significantly improves individual-frame quality and performance in low-data regimes. Experimental results show our method outperforms the previous approach of Nath et al. by 19.3% in MAE, 16.2% in PSNR, and 36.1% in SSIM. Most notably, we extend the reliable forecasting horizon from 36 to 50 hours. Through comprehensive evaluation using both traditional metrics and Fréchet Video Distance (FVD), we demonstrate that our approach produces more temporally coherent forecasts while maintaining competitive single-frame quality. Code accessible at https://github.com/Ren-creater/forecast-video-diffmodels.
中文摘要:本研究提出了一种新型视频扩散模型用于热带气旋预报,通过增强的时间依赖建模和两阶段训练策略,显著提升了预测的时间连贯性,并将可靠预报时长延长至50小时。
English Summary: This study introduces a novel video diffusion model for tropical cyclone forecasting that improves temporal coherence and extends the reliable prediction horizon to 50 hours through enhanced temporal modeling and a two-stage training strategy.

Authors:Xiang Huang, Hao Peng, Shuo Sun, Zhifeng Hao, Hui Lin, Shuhai Wang
Title: Multi-View Attention Syntactic Enhanced Graph Convolutional Network for Aspect-based Sentiment Analysis
Abstract:
Aspect-based Sentiment Analysis (ABSA) is the task aimed at predicting the sentiment polarity of aspect words within sentences. Recently, incorporating graph neural networks (GNNs) to capture additional syntactic structure information in the dependency tree derived from syntactic dependency parsing has been proven to be an effective paradigm for boosting ABSA. Despite GNNs enhancing model capability by fusing more types of information, most works only utilize a single topology view of the dependency tree or simply conflate different perspectives of information without distinction, which limits the model performance. To address these challenges, in this paper, we propose a new multi-view attention syntactic enhanced graph convolutional network (MASGCN) that weighs different syntactic information of views using attention mechanisms. Specifically, we first construct distance mask matrices from the dependency tree to obtain multiple subgraph views for GNNs. To aggregate features from different views, we propose a multi-view attention mechanism to calculate the attention weights of views. Furthermore, to incorporate more syntactic information, we fuse the dependency type information matrix into the adjacency matrices and present a structural entropy loss to learn the dependency type adjacency matrix. Comprehensive experiments on four benchmark datasets demonstrate that our model outperforms state-of-the-art methods. The codes and datasets are available at https://github.com/SELGroup/MASGCN.
中文: 本文提出MASGCN模型,通过注意力机制融合依存树的多视角句法信息来提升方面级情感分析性能,在多个基准数据集上实现了最优表现。
English: This paper introduces MASGCN, a multi-view attention graph network that enhances aspect-based sentiment analysis by integrating diverse syntactic perspectives from dependency trees, outperforming existing methods on benchmark datasets.

Authors:Jiahao Chen, Bin Qin, Jiangmeng Li, Hao Chen, Bing Su
Title: Rethinking the Bias of Foundation Model under Long-tailed Distribution
Abstract:
Long-tailed learning has garnered increasing attention due to its practical significance. Among the various approaches, the fine-tuning paradigm has gained considerable interest with the advent of foundation models. However, most existing methods primarily focus on leveraging knowledge from these models, overlooking the inherent biases introduced by the imbalanced training data they rely on. In this paper, we examine how such imbalances from pre-training affect long-tailed downstream tasks. Specifically, we find the imbalance biases inherited in foundation models on downstream task as parameter imbalance and data imbalance. During fine-tuning, we observe that parameter imbalance plays a more critical role, while data imbalance can be mitigated using existing re-balancing strategies. Moreover, we find that parameter imbalance cannot be effectively addressed by current re-balancing techniques, such as adjusting the logits, during training, unlike data imbalance. To tackle both imbalances simultaneously, we build our method on causal learning and view the incomplete semantic factor as the confounder, which brings spurious correlations between input samples and labels. To resolve the negative effects of this, we propose a novel backdoor adjustment method that learns the true causal effect between input samples and labels, rather than merely fitting the correlations in the data. Notably, we achieve an average performance increase of about $1.67\%$ on each dataset. Code is available: https://github.com/JiahaoChen1/Pre-train-Imbalance
中文摘要:本文研究了预训练数据不平衡对长尾下游任务的影响,识别出参数不平衡和数据不平衡是关键问题,并提出了一种基于因果学习的后门调整方法,在多个数据集上平均性能提升约1.67%。
English Summary: This paper investigates how imbalances in pre-training data affect long-tailed downstream tasks, identifying parameter and data imbalances as key issues, and proposes a novel causal learning-based backdoor adjustment method that achieves an average performance improvement of about 1.67% across datasets.

Authors:Chengting Yu, Xiaochen Zhao, Lei Liu, Shu Yang, Gaoang Wang, Erping Li, Aili Wang
Title: Efficient Logit-based Knowledge Distillation of Deep Spiking Neural Networks for Full-Range Timestep Deployment
Abstract:
Spiking Neural Networks (SNNs) are emerging as a brain-inspired alternative to traditional Artificial Neural Networks (ANNs), prized for their potential energy efficiency on neuromorphic hardware. Despite this, SNNs often suffer from accuracy degradation compared to ANNs and face deployment challenges due to fixed inference timesteps, which require retraining for adjustments, limiting operational flexibility. To address these issues, our work considers the spatio-temporal property inherent in SNNs, and proposes a novel distillation framework for deep SNNs that optimizes performance across full-range timesteps without specific retraining, enhancing both efficacy and deployment adaptability. We provide both theoretical analysis and empirical validations to illustrate that training guarantees the convergence of all implicit models across full-range timesteps. Experimental results on CIFAR-10, CIFAR-100, CIFAR10-DVS, and ImageNet demonstrate state-of-the-art performance among distillation-based SNNs training methods. Our code is available at https://github.com/Intelli-Chip-Lab/snn\_temporal\_decoupling\_distillation.
中文: 本研究提出了一种新型的脉冲神经网络蒸馏框架,无需重新训练即可优化全时间步的性能,通过理论和实验验证在多个基准测试中取得了领先水平。
English: This work introduces a novel distillation framework for Spiking Neural Networks (SNNs) that optimizes performance across all timesteps without retraining, achieving state-of-the-art results on multiple benchmarks through theoretical and empirical validation.

Authors:Weihang Su, Yichen Tang, Qingyao Ai, Junxi Yan, Changyue Wang, Hongning Wang, Ziyi Ye, Yujia Zhou, Yiqun Liu
Title: Parametric Retrieval Augmented Generation
Abstract:
Retrieval-augmented generation (RAG) techniques have emerged as a promising solution to enhance the reliability of large language models (LLMs) by addressing issues like hallucinations, outdated knowledge, and domain adaptation. In particular, existing RAG methods append relevant documents retrieved from external corpus or databases to the input of LLMs to guide their generation process, which we refer to as the in-context knowledge injection method. While this approach is simple and often effective, it has inherent limitations. Firstly, increasing the context length and number of relevant documents can lead to higher computational overhead and degraded performance, especially in complex reasoning tasks. More importantly, in-context knowledge injection operates primarily at the input level, but LLMs store their internal knowledge in their parameters. This gap fundamentally limits the capacity of in-context methods. To this end, we introduce Parametric retrieval-augmented generation (Parametric RAG), a new RAG paradigm that integrates external knowledge directly into the parameters of feed-forward networks (FFN) of an LLM through document parameterization. This approach not only saves online computational costs by eliminating the need to inject multiple documents into the LLMs' input context, but also deepens the integration of external knowledge into the parametric knowledge space of the LLM. Experimental results demonstrate that Parametric RAG substantially enhances both the effectiveness and efficiency of knowledge augmentation in LLMs. Also, it can be combined with in-context RAG methods to achieve even better performance. We have open-sourced all the code, data, and models in the following anonymized GitHub link: https://github.com/oneal2000/PRAG
中文: 检索增强生成(RAG)技术通过解决幻觉和知识过时问题提升大语言模型可靠性,但现有方法存在计算效率与知识融合的局限;提出的参数化RAG通过将外部知识直接嵌入模型参数,显著提升了增强效果与效率。
English: Retrieval-augmented generation (RAG) techniques enhance LLM reliability by addressing hallucinations and outdated knowledge, but existing methods face limitations in computational efficiency and knowledge integration; the proposed Parametric RAG overcomes these by embedding external knowledge directly into model parameters, improving both effectiveness and efficiency.

Authors:Karahan Sarıtaş, Peter Dayan, Tingke Shen, Surabhi S Nath
Title: Complexity in Complexity: Understanding Visual Complexity Through Structure, Color, and Surprise
Abstract:
Understanding how humans perceive visual complexity is a key area of study in visual cognition. Previous approaches to modeling visual complexity assessments have often resulted in intricate, difficult-to-interpret algorithms that employ numerous features or sophisticated deep learning architectures. While these complex models achieve high performance on specific datasets, they often sacrifice interpretability, making it challenging to understand the factors driving human perception of complexity. Recently (Shen, et al. 2024) proposed an interpretable segmentation-based model that accurately predicted complexity across various datasets, supporting the idea that complexity can be explained simply. In this work, we investigate the failure of their model to capture structural, color and surprisal contributions to complexity. To this end, we propose Multi-Scale Sobel Gradient (MSG) which measures spatial intensity variations, Multi-Scale Unique Color (MUC) which quantifies colorfulness across multiple scales, and surprise scores generated using a Large Language Model. We test our features on existing benchmarks and a novel dataset (Surprising Visual Genome) containing surprising images from Visual Genome. Our experiments demonstrate that modeling complexity accurately is not as simple as previously thought, requiring additional perceptual and semantic factors to address dataset biases. Our model improves predictive performance while maintaining interpretability, offering deeper insights into how visual complexity is perceived and assessed. Our code, analysis and data are available at https://github.com/Complexity-Project/Complexity-in-Complexity.
中文摘要:本研究通过引入多尺度梯度、色彩丰富度和语义惊奇度等特征,挑战了视觉复杂性可被简单解释的观点,在保持模型可解释性的同时提高了跨数据集的预测准确性。
English Summary: This study challenges the notion that visual complexity can be explained simply by introducing multi-scale gradient, colorfulness, and semantic surprise features, which improve prediction accuracy while maintaining model interpretability across diverse datasets.

Authors:Kentaro Kurihara, Masato Mita, Peinan Zhang, Shota Sasaki, Ryosuke Ishigami, Naoaki Okazaki
Title: LCTG Bench: LLM Controlled Text Generation Benchmark
Abstract:
The rise of large language models (LLMs) has led to more diverse and higher-quality machine-generated text. However, their high expressive power makes it difficult to control outputs based on specific business instructions. In response, benchmarks focusing on the controllability of LLMs have been developed, but several issues remain: (1) They primarily cover major languages like English and Chinese, neglecting low-resource languages like Japanese; (2) Current benchmarks employ task-specific evaluation metrics, lacking a unified framework for selecting models based on controllability across different use cases. To address these challenges, this research introduces LCTG Bench, the first Japanese benchmark for evaluating the controllability of LLMs. LCTG Bench provides a unified framework for assessing control performance, enabling users to select the most suitable model for their use cases based on controllability. By evaluating nine diverse Japanese-specific and multilingual LLMs like GPT-4, we highlight the current state and challenges of controllability in Japanese LLMs and reveal the significant gap between multilingual models and Japanese-specific models.
中文:本研究推出了首个日语基准LCTG Bench,旨在解决低资源语言中大型语言模型可控性评估框架缺失的问题,揭示了日语专用模型与多语言模型之间的显著性能差距。
English: This research introduces LCTG Bench, the first Japanese benchmark addressing the lack of unified controllability evaluation for LLMs in low-resource languages, revealing a significant performance gap between Japanese-specific and multilingual models.

Authors:Moritz Mock, Thomas Borsani, Giuseppe Di Fatta, Barbara Russo
Title: Optimizing Deep Learning Models to Address Class Imbalance in Code Comment Classification
Abstract:
Developers rely on code comments to document their work, track issues, and understand the source code. As such, comments provide valuable insights into developers' understanding of their code and describe their various intentions in writing the surrounding code. Recent research leverages natural language processing and deep learning to classify comments based on developers' intentions. However, such labelled data are often imbalanced, causing learning models to perform poorly. This work investigates the use of different weighting strategies of the loss function to mitigate the scarcity of certain classes in the dataset. In particular, various RoBERTa-based transformer models are fine-tuned by means of a hyperparameter search to identify their optimal parameter configurations. Additionally, we fine-tuned the transformers with different weighting strategies for the loss function to address class imbalances. Our approach outperforms the STACC baseline by 8.9 per cent on the NLBSE'25 Tool Competition dataset in terms of the average F1$_c$ score, and exceeding the baseline approach in 17 out of 19 cases with a gain ranging from -5.0 to 38.2. The source code is publicly available at https://github.com/moritzmock/NLBSE2025.
中文:本研究通过优化损失函数权重微调RoBERTa模型来解决代码注释分类中的类别不平衡问题,相比基线方法在F1分数上提升了8.9%。
English: This study addresses class imbalance in code comment classification by fine-tuning RoBERTa models with optimized loss function weighting, achieving an 8.9% improvement in F1 score over baseline methods.

Authors:Ruiqi Wu, Na Su, Chenran Zhang, Tengfei Ma, Tao Zhou, Zhiting Cui, Nianfeng Tang, Tianyu Mao, Yi Zhou, Wen Fan, Tianxing Wu, Shenqi Jing, Huazhu Fu
Title: MM-Retinal V2: Transfer an Elite Knowledge Spark into Fundus Vision-Language Pretraining
Abstract:
Vision-language pretraining (VLP) has been investigated to generalize across diverse downstream tasks for fundus image analysis. Although recent methods showcase promising achievements, they significantly rely on large-scale private image-text data but pay less attention to the pretraining manner, which limits their further advancements. In this work, we introduce MM-Retinal V2, a high-quality image-text paired dataset comprising CFP, FFA, and OCT image modalities. Then, we propose a novel fundus vision-language pretraining model, namely KeepFIT V2, which is pretrained by integrating knowledge from the elite data spark into categorical public datasets. Specifically, a preliminary textual pretraining is adopted to equip the text encoder with primarily ophthalmic textual knowledge. Moreover, a hybrid image-text knowledge injection module is designed for knowledge transfer, which is essentially based on a combination of global semantic concepts from contrastive learning and local appearance details from generative learning. Extensive experiments across zero-shot, few-shot, and linear probing settings highlight the generalization and transferability of KeepFIT V2, delivering performance competitive to state-of-the-art fundus VLP models trained on large-scale private image-text datasets. Our dataset and model are publicly available via https://github.com/lxirich/MM-Retinal.
中文: 本文提出MM-Retinal V2多模态视网膜数据集和KeepFIT V2视觉语言预训练模型,该模型通过混合学习方法整合眼科专业知识,在不依赖大规模私有数据集的情况下显著提升了眼底图像分析性能。
English: This paper introduces MM-Retinal V2, a multimodal retinal dataset, and KeepFIT V2, a novel vision-language pretraining model that enhances fundus image analysis by integrating specialized ophthalmic knowledge through hybrid learning methods, achieving competitive performance without relying on large private datasets.

Authors:Yu Li, Yi Huang, Guilin Qi, Junlan Feng, Nan Hu, Songlin Zhai, Haohan Xue, Yongrui Chen, Ruoyan Shen, Tongtong Wu
Title: Harnessing Diverse Perspectives: A Multi-Agent Framework for Enhanced Error Detection in Knowledge Graphs
Abstract:
Knowledge graphs are widely used in industrial applications, making error detection crucial for ensuring the reliability of downstream applications. Existing error detection methods often fail to effectively utilize fine-grained subgraph information and rely solely on fixed graph structures, while also lacking transparency in their decision-making processes, which results in suboptimal detection performance. In this paper, we propose a novel Multi-Agent framework for Knowledge Graph Error Detection (MAKGED) that utilizes multiple large language models (LLMs) in a collaborative setting. By concatenating fine-grained, bidirectional subgraph embeddings with LLM-based query embeddings during training, our framework integrates these representations to produce four specialized agents. These agents utilize subgraph information from different dimensions to engage in multi-round discussions, thereby improving error detection accuracy and ensuring a transparent decision-making process. Extensive experiments on FB15K and WN18RR demonstrate that MAKGED outperforms state-of-the-art methods, enhancing the accuracy and robustness of KG evaluation. For specific industrial scenarios, our framework can facilitate the training of specialized agents using domain-specific knowledge graphs for error detection, which highlights the potential industrial application value of our framework. Our code and datasets are available at https://github.com/kse-ElEvEn/MAKGED.
中文摘要:MAKGED框架通过多智能体协同和大语言模型,利用细粒度子图信息进行多轮透明决策,显著提升了知识图谱错误检测的准确性和可解释性,在基准测试中优于现有方法。
English Summary: The MAKGED framework introduces a multi-agent system using large language models to enhance knowledge graph error detection by leveraging fine-grained subgraph information and enabling transparent multi-round discussions, achieving superior accuracy on benchmark datasets.

Authors:Edoardo Cetin, Tianyu Zhao, Yujin Tang
Title: Large Language Models to Diffusion Finetuning
Abstract:
We propose a new finetuning method to provide pre-trained large language models (LMs) the ability to scale test-time compute through the diffusion framework. By increasing the number of diffusion steps, we show our finetuned models achieve monotonically increasing accuracy, directly translating to improved performance across downstream tasks. Furthermore, our finetuned models can expertly answer questions on specific topics by integrating powerful guidance techniques, and autonomously determine the compute required for a given problem by leveraging adaptive ODE solvers. Our method is universally applicable to any foundation model pre-trained with a cross-entropy loss and does not modify any of its original weights, fully preserving its strong single-step generation capabilities. We show our method is more effective and fully compatible with traditional finetuning approaches, introducing an orthogonal new direction to unify the strengths of the autoregressive and diffusion frameworks.
中文: 本研究提出一种微调方法,使预训练大语言模型能够通过扩散框架扩展测试时计算量,在不改变原始模型权重的前提下提升准确性和任务表现。
English: This study introduces a finetuning method that enables pre-trained large language models to scale test-time compute using the diffusion framework, enhancing accuracy and task performance without altering original model weights.

Authors:Karam Park, Jae Woong Soh, Nam Ik Cho
Title: Efficient Attention-Sharing Information Distillation Transformer for Lightweight Single Image Super-Resolution
Abstract:
Transformer-based Super-Resolution (SR) methods have demonstrated superior performance compared to convolutional neural network (CNN)-based SR approaches due to their capability to capture long-range dependencies. However, their high computational complexity necessitates the development of lightweight approaches for practical use. To address this challenge, we propose the Attention-Sharing Information Distillation (ASID) network, a lightweight SR network that integrates attention-sharing and an information distillation structure specifically designed for Transformer-based SR methods. We modify the information distillation scheme, originally designed for efficient CNN operations, to reduce the computational load of stacked self-attention layers, effectively addressing the efficiency bottleneck. Additionally, we introduce attention-sharing across blocks to further minimize the computational cost of self-attention operations. By combining these strategies, ASID achieves competitive performance with existing SR methods while requiring only around 300K parameters - significantly fewer than existing CNN-based and Transformer-based SR models. Furthermore, ASID outperforms state-of-the-art SR methods when the number of parameters is matched, demonstrating its efficiency and effectiveness. The code and supplementary material are available on the project page.
中文摘要:提出的注意力共享信息蒸馏(ASID)网络通过整合注意力共享和改进的信息蒸馏结构,有效解决了基于Transformer的超分辨率方法计算复杂度高的问题,仅用约30万参数即实现了与现有方法相媲美的性能。
English Summary: The proposed Attention-Sharing Information Distillation (ASID) network addresses the high computational complexity of Transformer-based super-resolution methods by integrating attention-sharing and modified information distillation, achieving competitive performance with only around 300K parameters.

Authors:Muhammad Maaz, Timothy C. Y. Chan
Title: Formal Verification of Markov Processes with Learned Parameters
Abstract:
We introduce the problem of formally verifying properties of Markov processes where the parameters are given by the output of machine learning models. For a broad class of machine learning models, including linear models, tree-based models, and neural networks, verifying properties of Markov chains like reachability, hitting time, and total reward can be formulated as a bilinear program. We develop a decomposition and bound propagation scheme for solving the bilinear program and show through computational experiments that our method solves the problem to global optimality up to 100x faster than state-of-the-art solvers. To demonstrate the practical utility of our approach, we apply it to a real-world healthcare case study. Along with the paper, we release markovml, an open-source tool for building Markov processes, integrating pretrained machine learning models, and verifying their properties, available at https://github.com/mmaaz-git/markovml.
Chinese: 本研究提出了一种方法,用于形式化验证由机器学习模型输出参数的马可夫过程属性,将验证问题构建为双线性规划并以比现有求解器快100倍的速度求解,通过医疗案例展示了实用性并发布了开源工具。
English: This study presents a method for formally verifying properties of Markov processes with parameters derived from machine learning models, formulating the verification as a bilinear program and solving it up to 100 times faster than existing solvers, with practical application demonstrated in a healthcare case study and an open-source tool released.

Authors:Jialun Cai, Mengyuan Liu, Hong Liu, Wenhao Li, Shuheng Zhou
Title: NanoHTNet: Nano Human Topology Network for Efficient 3D Human Pose Estimation
Abstract:
The widespread application of 3D human pose estimation (HPE) is limited by resource-constrained edge devices, requiring more efficient models. A key approach to enhancing efficiency involves designing networks based on the structural characteristics of input data. However, effectively utilizing the structural priors in human skeletal inputs remains challenging. To address this, we leverage both explicit and implicit spatio-temporal priors of the human body through innovative model design and a pre-training proxy task. First, we propose a Nano Human Topology Network (NanoHTNet), a tiny 3D HPE network with stacked Hierarchical Mixers to capture explicit features. Specifically, the spatial Hierarchical Mixer efficiently learns the human physical topology across multiple semantic levels, while the temporal Hierarchical Mixer with discrete cosine transform and low-pass filtering captures local instantaneous movements and global action coherence. Moreover, Efficient Temporal-Spatial Tokenization (ETST) is introduced to enhance spatio-temporal interaction and reduce computational complexity significantly. Second, PoseCLR is proposed as a general pre-training method based on contrastive learning for 3D HPE, aimed at extracting implicit representations of human topology. By aligning 2D poses from diverse viewpoints in the proxy task, PoseCLR aids 3D HPE encoders like NanoHTNet in more effectively capturing the high-dimensional features of the human body, leading to further performance improvements. Extensive experiments verify that NanoHTNet with PoseCLR outperforms other state-of-the-art methods in efficiency, making it ideal for deployment on edge devices like the Jetson Nano. Code and models are available at https://github.com/vefalun/NanoHTNet.
Chinese: 针对3D人体姿态估计在边缘设备上的应用限制,我们开发了NanoHTNet紧凑网络,通过分层混合器和ETST捕捉显式时空特征,并提出了PoseCLR预训练方法以增强隐式拓扑学习,经广泛实验验证实现了卓越的效率和性能。
English: To address the limitations of 3D human pose estimation on edge devices, we developed NanoHTNet, a compact network that captures explicit spatio-temporal features through hierarchical mixers and ETST, and introduced PoseCLR, a pre-training method that enhances implicit topology learning, achieving superior efficiency and performance validated by extensive experiments.

Authors:Ashim Dahal, Saydul Akbar Murad, Nick Rahimi
Title: Efficiency Bottlenecks of Convolutional Kolmogorov-Arnold Networks: A Comprehensive Scrutiny with ImageNet, AlexNet, LeNet and Tabular Classification
Abstract:
Algorithmic level developments like Convolutional Neural Networks, transformers, attention mechanism, Retrieval Augmented Generation and so on have changed Artificial Intelligence. Recent such development was observed by Kolmogorov-Arnold Networks that suggested to challenge the fundamental concept of a Neural Network, thus change Multilayer Perceptron, and Convolutional Neural Networks. They received a good reception in terms of scientific modeling, yet had some drawbacks in terms of efficiency. In this paper, we train Convolutional Kolmogorov Arnold Networks (CKANs) with the ImageNet-1k dataset with 1.3 million images, MNIST dataset with 60k images and a tabular biological science related MoA dataset and test the promise of CKANs in terms of FLOPS, Inference Time, number of trainable parameters and training time against the accuracy, precision, recall and f-1 score they produce against the standard industry practice on CNN models. We show that the CKANs perform fair yet slower than CNNs in small size dataset like MoA and MNIST but are not nearly comparable as the dataset gets larger and more complex like the ImageNet. The code implementation of this paper can be found on the link: https://github.com/ashimdahal/Study-of-Convolutional-Kolmogorov-Arnold-networks
中文摘要:本文评估了卷积柯尔莫哥洛夫-阿诺德网络(CKANs)与标准卷积神经网络的对比,结果表明CKANs在MNIST和MoA等小型数据集上表现尚可,但在ImageNet等大型复杂数据集上的效率和性能明显不足。
English Summary: This paper evaluates Convolutional Kolmogorov-Arnold Networks (CKANs) against standard CNNs, demonstrating that while CKANs perform fairly on smaller datasets like MNIST and MoA, they lag significantly in efficiency and performance on larger, complex datasets such as ImageNet.

Authors:Tianfu Wang, Yi Zhan, Jianxun Lian, Zhengyu Hu, Nicholas Jing Yuan, Qi Zhang, Xing Xie, Hui Xiong
Title: LLM-powered Multi-agent Framework for Goal-oriented Learning in Intelligent Tutoring System
Abstract:
Intelligent Tutoring Systems (ITSs) have revolutionized education by offering personalized learning experiences. However, as goal-oriented learning, which emphasizes efficiently achieving specific objectives, becomes increasingly important in professional contexts, existing ITSs often struggle to deliver this type of targeted learning experience. In this paper, we propose GenMentor, an LLM-powered multi-agent framework designed to deliver goal-oriented, personalized learning within ITS. GenMentor begins by accurately mapping learners' goals to required skills using a fine-tuned LLM trained on a custom goal-to-skill dataset. After identifying the skill gap, it schedules an efficient learning path using an evolving optimization approach, driven by a comprehensive and dynamic profile of learners' multifaceted status. Additionally, GenMentor tailors learning content with an exploration-drafting-integration mechanism to align with individual learner needs. Extensive automated and human evaluations demonstrate GenMentor's effectiveness in learning guidance and content quality. Furthermore, we have deployed it in practice and also implemented it as an application. Practical human study with professional learners further highlights its effectiveness in goal alignment and resource targeting, leading to enhanced personalization. Supplementary resources are available at https://github.com/GeminiLight/gen-mentor.
Chinese: GenMentor 是一种基于大语言模型的多智能体框架,通过精准技能映射、优化学习路径和个性化内容定制,为智能辅导系统提供目标导向的个性化学习,其有效性已在自动化评估和实际应用中得到了验证。
English: GenMentor is an innovative LLM-powered multi-agent framework that enhances Intelligent Tutoring Systems by providing goal-oriented, personalized learning through precise skill mapping, optimized learning paths, and tailored content, proven effective in both automated evaluations and practical applications.

Authors:Yuxuan Gu, Wuyang Zhou, Giorgos Iacovides, Danilo Mandic
Title: TensorLLM: Tensorising Multi-Head Attention for Enhanced Reasoning and Compression in LLMs
Abstract:
The reasoning abilities of Large Language Models (LLMs) can be improved by structurally denoising their weights, yet existing techniques primarily focus on denoising the feed-forward network (FFN) of the transformer block, and can not efficiently utilise the Multi-head Attention (MHA) block, which is the core of transformer architectures. To address this issue, we propose a novel intuitive framework that, at its very core, performs MHA compression through a multi-head tensorisation process and the Tucker decomposition. This enables both higher-dimensional structured denoising and compression of the MHA weights, by enforcing a shared higher-dimensional subspace across the weights of the multiple attention heads. We demonstrate that this approach consistently enhances the reasoning capabilities of LLMs across multiple benchmark datasets, and for both encoder-only and decoder-only architectures, while achieving compression rates of up to $\sim 250$ times in the MHA weights, all without requiring any additional data, training, or fine-tuning. Furthermore, we show that the proposed method can be seamlessly combined with existing FFN-only-based denoising techniques to achieve further improvements in LLM reasoning performance.
Chinese: 本研究提出一种新颖框架,通过多头注意力张量化和塔克分解对权重进行结构化去噪与压缩,无需额外数据或训练即可实现高达250倍的压缩,并显著提升大语言模型的推理能力。
English: This study introduces a novel framework that enhances Large Language Models' reasoning by structurally denoising and compressing Multi-head Attention weights through tensorization and Tucker decomposition, achieving up to 250x compression and improved performance without extra data or training.

Authors:Ayush Gupta, Rama Chellappa
Title: MimicGait: A Model Agnostic approach for Occluded Gait Recognition using Correlational Knowledge Distillation
Abstract:
Gait recognition is an important biometric technique over large distances. State-of-the-art gait recognition systems perform very well in controlled environments at close range. Recently, there has been an increased interest in gait recognition in the wild prompted by the collection of outdoor, more challenging datasets containing variations in terms of illumination, pitch angles, and distances. An important problem in these environments is that of occlusion, where the subject is partially blocked from camera view. While important, this problem has received little attention. Thus, we propose MimicGait, a model-agnostic approach for gait recognition in the presence of occlusions. We train the network using a multi-instance correlational distillation loss to capture both inter-sequence and intra-sequence correlations in the occluded gait patterns of a subject, utilizing an auxiliary Visibility Estimation Network to guide the training of the proposed mimic network. We demonstrate the effectiveness of our approach on challenging real-world datasets like GREW, Gait3D and BRIAR. We release the code in https://github.com/Ayush-00/mimicgait.
中文: MimicGait是一种模型无关的方法,通过多实例相关性蒸馏和可见性估计网络解决步态识别中的遮挡问题,并在真实数据集上验证了其有效性。
English: MimicGait is a model-agnostic approach that addresses occlusion challenges in gait recognition by using multi-instance correlational distillation and a visibility estimation network, demonstrating effectiveness on real-world datasets.

Authors:Vaclav Knapp, Matyas Bohacek
Title: Can Pose Transfer Models Generate Realistic Human Motion?
Abstract:
Recent pose-transfer methods aim to generate temporally consistent and fully controllable videos of human action where the motion from a reference video is reenacted by a new identity. We evaluate three state-of-the-art pose-transfer methods -- AnimateAnyone, MagicAnimate, and ExAvatar -- by generating videos with actions and identities outside the training distribution and conducting a participant study about the quality of these videos. In a controlled environment of 20 distinct human actions, we find that participants, presented with the pose-transferred videos, correctly identify the desired action only 42.92% of the time. Moreover, the participants find the actions in the generated videos consistent with the reference (source) videos only 36.46% of the time. These results vary by method: participants find the splatting-based ExAvatar more consistent and photorealistic than the diffusion-based AnimateAnyone and MagicAnimate.
Chinese: 一项针对三种姿态迁移方法的研究显示,参与者仅能识别42.92%的目标动作,对生成视频与参考视频动作一致性的认可度仅为36.46%,其中基于点云渲染的ExAvatar在真实感方面优于扩散模型方法。
English: A study evaluating three pose-transfer methods found that participants correctly identified target actions only 42.92% of the time and perceived motion consistency with reference videos in just 36.46% of cases, with ExAvatar outperforming diffusion-based alternatives in realism.

Authors:Yang Ji, Ying Sun, Yuting Zhang, Zhigaoyuan Wang, Yuanxin Zhuang, Zheng Gong, Dazhong Shen, Chuan Qin, Hengshu Zhu, Hui Xiong
Title: A Comprehensive Survey on Self-Interpretable Neural Networks
Abstract:
Neural networks have achieved remarkable success across various fields. However, the lack of interpretability limits their practical use, particularly in critical decision-making scenarios. Post-hoc interpretability, which provides explanations for pre-trained models, is often at risk of robustness and fidelity. This has inspired a rising interest in self-interpretable neural networks, which inherently reveal the prediction rationale through the model structures. Although there exist surveys on post-hoc interpretability, a comprehensive and systematic survey of self-interpretable neural networks is still missing. To address this gap, we first collect and review existing works on self-interpretable neural networks and provide a structured summary of their methodologies from five key perspectives: attribution-based, function-based, concept-based, prototype-based, and rule-based self-interpretation. We also present concrete, visualized examples of model explanations and discuss their applicability across diverse scenarios, including image, text, graph data, and deep reinforcement learning. Additionally, we summarize existing evaluation metrics for self-interpretability and identify open challenges in this field, offering insights for future research. To support ongoing developments, we present a publicly accessible resource to track advancements in this domain: https://github.com/yangji721/Awesome-Self-Interpretable-Neural-Network.
中文: 本综述系统梳理了自解释神经网络,从五种方法视角分类总结其应用,并评估不同场景下的适用性,同时指出现有挑战与未来研究方向。
English: This survey systematically reviews self-interpretable neural networks, categorizing them into five methodological approaches and evaluating their applications across various data types while identifying current challenges and future research directions.

Authors:Ali Khodabandeh Yalabadi, Mehdi Yazdani-Jahromi, Ozlem Ozmen Garibay
Title: BoKDiff: Best-of-K Diffusion Alignment for Target-Specific 3D Molecule Generation
Abstract:
Structure-based drug design (SBDD) leverages the 3D structure of biomolecular targets to guide the creation of new therapeutic agents. Recent advances in generative models, including diffusion models and geometric deep learning, have demonstrated promise in optimizing ligand generation. However, the scarcity of high-quality protein-ligand complex data and the inherent challenges in aligning generated ligands with target proteins limit the effectiveness of these methods. We propose BoKDiff, a novel framework that enhances ligand generation by combining multi-objective optimization and Best-of-K alignment methodologies. Built upon the DecompDiff model, BoKDiff generates diverse candidates and ranks them using a weighted evaluation of molecular properties such as QED, SA, and docking scores. To address alignment challenges, we introduce a method that relocates the center of mass of generated ligands to their docking poses, enabling accurate sub-component extraction. Additionally, we integrate a Best-of-N (BoN) sampling approach, which selects the optimal ligand from multiple generated candidates without requiring fine-tuning. BoN achieves exceptional results, with QED values exceeding 0.6, SA scores above 0.75, and a success rate surpassing 35%, demonstrating its efficiency and practicality. BoKDiff achieves state-of-the-art results on the CrossDocked2020 dataset, including a -8.58 average Vina docking score and a 26% success rate in molecule generation. This study is the first to apply Best-of-K alignment and Best-of-N sampling to SBDD, highlighting their potential to bridge generative modeling with practical drug discovery requirements. The code is provided at https://github.com/khodabandeh-ali/BoKDiff.git.
Chinese: BoKDiff是一种创新框架,它通过结合多目标优化与Best-of-K对齐及Best-of-N采样方法,显著提升了基于结构的药物设计效果,在配体生成中实现了最先进的性能,并展现出优异的分子属性优化效率。
English: BoKDiff is a novel framework that enhances structure-based drug design by combining multi-objective optimization with Best-of-K alignment and Best-of-N sampling, achieving state-of-the-art results in ligand generation and demonstrating high efficiency in molecular property optimization.

Authors:Jiajun Dong, Chengkun Wang, Wenzhao Zheng, Lei Chen, Jiwen Lu, Yansong Tang
Title: GaussianToken: An Effective Image Tokenizer with 2D Gaussian Splatting
Abstract:
Effective image tokenization is crucial for both multi-modal understanding and generation tasks due to the necessity of the alignment with discrete text data. To this end, existing approaches utilize vector quantization (VQ) to project pixels onto a discrete codebook and reconstruct images from the discrete representation. However, compared with the continuous latent space, the limited discrete codebook space significantly restrict the representational ability of these image tokenizers. In this paper, we propose GaussianToken: An Effective Image Tokenizer with 2D Gaussian Splatting as a solution. We first represent the encoded samples as multiple flexible featured 2D Gaussians characterized by positions, rotation angles, scaling factors, and feature coefficients. We adopt the standard quantization for the Gaussian features and then concatenate the quantization results with the other intrinsic Gaussian parameters before the corresponding splatting operation and the subsequent decoding module. In general, GaussianToken integrates the local influence of 2D Gaussian distribution into the discrete space and thus enhances the representation capability of the image tokenizer. Competitive reconstruction performances on CIFAR, Mini-ImageNet, and ImageNet-1K demonstrate the effectiveness of our framework. Our code is available at: https://github.com/ChrisDong-THU/GaussianToken.
中文摘要:GaussianToken通过将图像表示为灵活的二维高斯分布并量化其特征,有效提升了图像分词器的表示能力,克服了离散码本的局限性,在多个数据集上实现了优异的图像重建效果。
English Summary: GaussianToken enhances image tokenization by representing images with flexible 2D Gaussians and quantizing their features, overcoming the limitations of discrete codebooks and improving reconstruction performance across multiple datasets.

Authors:Chenglong Ma, Zilong Li, Yuanlin Li, Jing Han, Junping Zhang, Yi Zhang, Jiannan Liu, Hongming Shan
Title: Radiologist-in-the-Loop Self-Training for Generalizable CT Metal Artifact Reduction
Abstract:
Metal artifacts in computed tomography (CT) images can significantly degrade image quality and impede accurate diagnosis. Supervised metal artifact reduction (MAR) methods, trained using simulated datasets, often struggle to perform well on real clinical CT images due to a substantial domain gap. Although state-of-the-art semi-supervised methods use pseudo ground-truths generated by a prior network to mitigate this issue, their reliance on a fixed prior limits both the quality and quantity of these pseudo ground-truths, introducing confirmation bias and reducing clinical applicability. To address these limitations, we propose a novel Radiologist-In-the-loop SElf-training framework for MAR, termed RISE-MAR, which can integrate radiologists' feedback into the semi-supervised learning process, progressively improving the quality and quantity of pseudo ground-truths for enhanced generalization on real clinical CT images. For quality assurance, we introduce a clinical quality assessor model that emulates radiologist evaluations, effectively selecting high-quality pseudo ground-truths for semi-supervised training. For quantity assurance, our self-training framework iteratively generates additional high-quality pseudo ground-truths, expanding the clinical dataset and further improving model generalization. Extensive experimental results on multiple clinical datasets demonstrate the superior generalization performance of our RISE-MAR over state-of-the-art methods, advancing the development of MAR models for practical application. Code is available at https://github.com/Masaaki-75/rise-mar.
中文: 本文提出RISE-MAR框架,通过融入放射科医生反馈的自我训练机制,逐步提升伪真实标注的质量与数量,有效改善CT图像金属伪影消除效果,在真实临床数据上展现出优越的泛化性能。
English: This paper introduces RISE-MAR, a radiologist-in-the-loop self-training framework that enhances metal artifact reduction in CT images by progressively improving pseudo ground-truth quality and quantity through clinical feedback and iterative generation, demonstrating superior generalization on real clinical data.

Authors:Zeyu Gan, Yun Liao, Yong Liu
Title: Rethinking External Slow-Thinking: From Snowball Errors to Probability of Correct Reasoning
Abstract:
Test-time scaling, which is also often referred to as slow-thinking, has been demonstrated to enhance multi-step reasoning in large language models (LLMs). However, despite its widespread utilization, the mechanisms underlying slow-thinking methods remain poorly understood. This paper explores the mechanisms of external slow-thinking from a theoretical standpoint. We begin by examining the snowball error effect within the LLM reasoning process and connect it to the likelihood of correct reasoning using information theory. Building on this, we show that external slow-thinking methods can be interpreted as strategies to mitigate the error probability. We further provide a comparative analysis of popular external slow-thinking approaches, ranging from simple to complex, highlighting their differences and interrelationships. Our findings suggest that the efficacy of these methods is not primarily determined by the specific framework employed, and that expanding the search scope or the model's internal reasoning capacity may yield more sustained improvements in the long term. We open-source our code at https://github.com/ZyGan1999/Snowball-Errors-and-Probability.
中文: 测试时扩展(即慢思考)通过扩大搜索范围或增强推理能力来降低错误概率,从而提升大型语言模型的多步推理能力,其效果主要不依赖于特定框架。
English: Test-time scaling, or slow-thinking, improves multi-step reasoning in LLMs by mitigating error probability through expanding search scope or enhancing reasoning capacity, with efficacy not primarily dependent on specific frameworks.

Authors:Zhiyuan Fan, Weinong Wang, Xing Wu, Debing Zhang
Title: SedarEval: Automated Evaluation using Self-Adaptive Rubrics
Abstract:
The evaluation paradigm of LLM-as-judge gains popularity due to its significant reduction in human labor and time costs. This approach utilizes one or more large language models (LLMs) to assess the quality of outputs from other LLMs. However, existing methods rely on generic scoring rubrics that fail to consider the specificities of each question and its problem-solving process, compromising precision and stability in assessments. Inspired by human examination scoring processes, we propose a new evaluation paradigm based on self-adaptive rubrics. Specifically, we create detailed scoring rubrics for each question, capturing the primary and secondary criteria in a structured format of scoring and deduction points that mimic a human evaluator's analytical process. Building on this paradigm, we further develop a novel benchmark called SedarEval, which covers a range of domains including long-tail knowledge, mathematics, coding, and logical reasoning. SedarEval consists of 1,000 meticulously crafted questions, each with its own self-adaptive rubric. To further streamline the evaluation, we train a specialized evaluator language model (evaluator LM) to supplant human graders. Using the same training data, our evaluator LM achieves a higher concordance rate with human grading results than other paradigms, including GPT-4, highlighting the superiority and efficiency of our approach. We release our dataset at https://github.com/wwn1233/sedareval.
Chinese: LLM作为评判者的评估范式虽降低了人工成本,但因通用评分标准缺乏精确性,故提出自适应评分标准和SedarEval基准,其专用评估语言模型在与人评分一致性上优于现有方法。
English: The LLM-as-judge evaluation paradigm reduces human effort but lacks precision due to generic rubrics, leading to the proposal of self-adaptive rubrics and the SedarEval benchmark with a specialized evaluator LM that outperforms existing methods in aligning with human grading.

Authors:Soheil Gharatappeh, Salimeh Yasaei Sekeh
Title: Information Consistent Pruning: How to Efficiently Search for Sparse Networks?
Abstract:
Iterative magnitude pruning methods (IMPs), proven to be successful in reducing the number of insignificant nodes in over-parameterized deep neural networks (DNNs), have been getting an enormous amount of attention with the rapid deployment of DNNs into cutting-edge technologies with computation and memory constraints. Despite IMPs popularity in pruning networks, a fundamental limitation of existing IMP algorithms is the significant training time required for each pruning iteration. Our paper introduces a novel \textit{stopping criterion} for IMPs that monitors information and gradient flows between networks layers and minimizes the training time. Information Consistent Pruning (\ourmethod{}) eliminates the need to retrain the network to its original performance during intermediate steps while maintaining overall performance at the end of the pruning process. Through our experiments, we demonstrate that our algorithm is more efficient than current IMPs across multiple dataset-DNN combinations. We also provide theoretical insights into the core idea of our algorithm alongside mathematical explanations of flow-based IMP. Our code is available at \url{https://github.com/Sekeh-Lab/InfCoP}.
中文: 本文提出信息一致性剪枝方法,通过监控网络层间的信息和梯度流来减少训练时间,在剪枝过程中无需完全重训练即可保持性能,显著提升了迭代幅度剪枝的效率。
English: The paper introduces Information Consistent Pruning, a novel stopping criterion for iterative magnitude pruning that reduces training time by monitoring information and gradient flows, maintaining performance without full retraining between steps.

Authors:Dakuan Lu, Xiaoyu Tan, Rui Xu, Tianchu Yao, Chao Qu, Wei Chu, Yinghui Xu, Yuan Qi
Title: SCP-116K: A High-Quality Problem-Solution Dataset and a Generalized Pipeline for Automated Extraction in the Higher Education Science Domain
Abstract:
Recent breakthroughs in large language models (LLMs) exemplified by the impressive mathematical and scientific reasoning capabilities of the o1 model have spotlighted the critical importance of high-quality training data in advancing LLM performance across STEM disciplines. While the mathematics community has benefited from a growing body of curated datasets, the scientific domain at the higher education level has long suffered from a scarcity of comparable resources. To address this gap, we present SCP-116K, a new large-scale dataset of 116,756 high-quality problem-solution pairs, automatically extracted from heterogeneous sources using a streamlined and highly generalizable pipeline. Our approach involves stringent filtering to ensure the scientific rigor and educational level of the extracted materials, while maintaining adaptability for future expansions or domain transfers. By openly releasing both the dataset and the extraction pipeline, we seek to foster research on scientific reasoning, enable comprehensive performance evaluations of new LLMs, and lower the barrier to replicating the successes of advanced models like o1 in the broader science community. We believe SCP-116K will serve as a critical resource, catalyzing progress in high-level scientific reasoning tasks and promoting further innovations in LLM development. The dataset and code are publicly available at https://github.com/AQA6666/SCP-116K-open.
中文摘要:SCP-116K数据集通过自动化流程构建了11.6万组高质量科学问题解决方案,填补了高等教育STEM领域数据资源的空白,其开源发布将推动大语言模型的科学推理能力发展并降低先进模型复现门槛。
English Summary: The SCP-116K dataset introduces 116,756 high-quality scientific problem-solution pairs to address the scarcity of STEM resources at higher education levels, providing an open resource to advance LLM reasoning capabilities and replicate successes like the o1 model.

Authors:Romeo Sommerfeld, Christian Helms, Ralf Herbrich
Title: Approximate Message Passing for Bayesian Neural Networks
Abstract:
Bayesian neural networks (BNNs) offer the potential for reliable uncertainty quantification and interpretability, which are critical for trustworthy AI in high-stakes domains. However, existing methods often struggle with issues such as overconfidence, hyperparameter sensitivity, and posterior collapse, leaving room for alternative approaches. In this work, we advance message passing (MP) for BNNs and present a novel framework that models the predictive posterior as a factor graph. To the best of our knowledge, our framework is the first MP method that handles convolutional neural networks and avoids double-counting training data, a limitation of previous MP methods that causes overconfidence. We evaluate our approach on CIFAR-10 with a convolutional neural network of roughly 890k parameters and find that it can compete with the SOTA baselines AdamW and IVON, even having an edge in terms of calibration. On synthetic data, we validate the uncertainty estimates and observe a strong correlation (0.9) between posterior credible intervals and its probability of covering the true data-generating function outside the training range. While our method scales to an MLP with 5.6 million parameters, further improvements are necessary to match the scale and performance of state-of-the-art variational inference methods.
中文: 本研究为贝叶斯神经网络提出了一种新颖的消息传递框架,该框架能有效处理卷积网络并避免数据重复计算,在CIFAR-10数据集上展现出与先进基线相当的校准性能,并在合成数据上验证了可靠的uncertainty量化能力,同时承认其在扩展性方面仍需改进。
English: This study introduces a novel message passing framework for Bayesian neural networks that effectively handles convolutional networks and prevents data double-counting, achieving competitive calibration on CIFAR-10 and demonstrating reliable uncertainty quantification on synthetic data while acknowledging scalability limitations compared to state-of-the-art methods.

Authors:Lin Yueyu, Li Zhiyuan, Peter Yue, Liu Xiao
Title: ARWKV: Pretrain is not what we need, an RNN-Attention-Based Language Model Born from Transformer
Abstract:
As is known, hybrid quadratic and subquadratic attention models in multi-head architectures have surpassed both Transformer and Linear RNN models , with these works primarily focusing on reducing KV complexity and improving efficiency. For further research on expressiveness, we introduce our series of models distilled from Qwen 2.5, based on pure native RWKV-7 attention, which aims to make RNN more expressive and demonstrates state tracking ability beyond transformers. We work with QRWK 32B based on RWKV-6 architecture, another approach that reduces the entire knowledge processing time to just 8 hours using 16 AMD MI300X GPUs while maintaining Qwen 2.5's performance. In fact, the distillation process can utilize any LLM, not just Qwen, and enables knowledge transfer from larger LLMs to smaller ones with more fewer tokens. We will explain the detailed process and share our insights on building more powerful foundation models. Please note that this is an ongoing work that will be updated continuously. The model checkpoints and source code are available at \href{https://github.com/yynil/RWKVInside}{https://github.com/yynil/RWKVInside}, \href{https://huggingface.co/RWKV-Red-Team/ARWKV-7B-Preview-0.1}{https://huggingface.co/RWKV-Red-Team/ARWKV-7B-Preview-0.1}.
中文: 本研究基于纯原生RWKV-7注意力机制,推出了从Qwen 2.5蒸馏而来的系列模型,旨在提升RNN的表达能力和状态追踪性能以超越Transformer,同时展示了基于RWKV-6架构的QRWK 32B模型,仅用16个GPU在8小时内即可完成知识处理。
English: This research introduces a series of models derived from Qwen 2.5 using pure RWKV-7 attention, aiming to enhance RNN expressiveness and state tracking beyond transformers, while also presenting QRWK 32B based on RWKV-6 for efficient knowledge processing in just 8 hours with 16 GPUs.

Authors:Jiahang Tu, Qian Feng, Jiahua Dong, Hanbin Zhao, Chao Zhang, Nicu Sebe, Hui Qian
Title: CE-SDWV: Effective and Efficient Concept Erasure for Text-to-Image Diffusion Models via a Semantic-Driven Word Vocabulary
Abstract:
Large-scale text-to-image (T2I) diffusion models have achieved remarkable generative performance about various concepts. With the limitation of privacy and safety in practice, the generative capability concerning NSFW (Not Safe For Work) concepts is undesirable, e.g., producing sexually explicit photos, and licensed images. The concept erasure task for T2I diffusion models has attracted considerable attention and requires an effective and efficient method. To achieve this goal, we propose a CE-SDWV framework, which removes the target concepts (e.g., NSFW concepts) of T2I diffusion models in the text semantic space by only adjusting the text condition tokens and does not need to re-train the original T2I diffusion model's weights. Specifically, our framework first builds a target concept-related word vocabulary to enhance the representation of the target concepts within the text semantic space, and then utilizes an adaptive semantic component suppression strategy to ablate the target concept-related semantic information in the text condition tokens. To further adapt the above text condition tokens to the original image semantic space, we propose an end-to-end gradient-orthogonal token optimization strategy. Extensive experiments on I2P and UnlearnCanvas benchmarks demonstrate the effectiveness and efficiency of our method. Code is available at https://github.com/TtuHamg/CE-SDWV.
中文:CE-SDWV框架通过调整文本条件令牌在语义空间中消除文本到图像扩散模型中的不良概念(如NSFW内容),无需重新训练模型,实验证明其高效有效。
English: The CE-SDWV framework effectively removes undesirable concepts like NSFW content from text-to-image diffusion models by modifying text tokens in the semantic space without retraining the model, as validated through extensive experiments.

Authors:Jiadong Shi, Chunyu Duan, Hao Lei, Liangmin Wang
Title: Real-CATS: A Practical Training Ground for Emerging Research on Cryptocurrency Cybercrime Detection
Abstract:
Cybercriminals pose a significant threat to blockchain trading security, causing $40.9 billion in losses in 2024. However, the lack of an effective real-world address dataset hinders the advancement of cybercrime detection research. The anti-cybercrime efforts of researchers from broader fields, such as statistics and artificial intelligence, are blocked by data scarcity. In this paper, we present Real-CATS, a Real-world dataset of Cryptocurrency Addresses with Transaction profileS, serving as a practical training ground for developing and assessing detection methods. Real-CATS comprises 103,203 criminal addresses from real-world reports and 106,196 benign addresses from exchange customers. It satifies the C3R characteristics (Comprehensiveness, Classifiability, Customizability, and Real-world Transferability), which are fundemental for practical detection of cryptocurrency cybercrime. The dataset provides three main functions: 1) effective evaluation of detection methods, 2) support for feature extensions, and 3) a new evaluation scenario for real-world deployment. Real-CATS also offers opportunities to expand cybercrime measurement studies. It is particularly beneficial for researchers without cryptocurrency-related knowledge to engage in this emerging research field. We hope that studies on cryptocurrency cybercrime detection will be promoted by an increasing number of cross-disciplinary researchers drawn to this versatile data platform. All datasets are available at https://github.com/sjdseu/Real-CATS
网络犯罪分子对区块链交易安全构成重大威胁,但数据稀缺阻碍了研究进展,而Real-CATS数据集通过提供全面的现实世界数据来推动检测方法发展并支持跨学科研究。
Cybercriminals caused massive losses in blockchain trading, but research is hindered by data scarcity, which the Real-CATS dataset addresses by providing comprehensive real-world data to advance detection methods and support cross-disciplinary studies.

Authors:Oubo Ma, Linkang Du, Yang Dai, Chunyi Zhou, Qingming Li, Yuwen Pu, Shouling Ji
Title: UNIDOOR: A Universal Framework for Action-Level Backdoor Attacks in Deep Reinforcement Learning
Abstract:
Deep reinforcement learning (DRL) is widely applied to safety-critical decision-making scenarios. However, DRL is vulnerable to backdoor attacks, especially action-level backdoors, which pose significant threats through precise manipulation and flexible activation, risking outcomes like vehicle collisions or drone crashes. The key distinction of action-level backdoors lies in the utilization of the backdoor reward function to associate triggers with target actions. Nevertheless, existing studies typically rely on backdoor reward functions with fixed values or conditional flipping, which lack universality across diverse DRL tasks and backdoor designs, resulting in fluctuations or even failure in practice. This paper proposes the first universal action-level backdoor attack framework, called UNIDOOR, which enables adaptive exploration of backdoor reward functions through performance monitoring, eliminating the reliance on expert knowledge and grid search. We highlight that action tampering serves as a crucial component of action-level backdoor attacks in continuous action scenarios, as it addresses attack failures caused by low-frequency target actions. Extensive evaluations demonstrate that UNIDOOR significantly enhances the attack performance of action-level backdoors, showcasing its universality across diverse attack scenarios, including single/multiple agents, single/multiple backdoors, discrete/continuous action spaces, and sparse/dense reward signals. Furthermore, visualization results encompassing state distribution, neuron activation, and animations demonstrate the stealthiness of UNIDOOR. The source code of UNIDOOR can be found at https://github.com/maoubo/UNIDOOR.
中文: 本文提出首个通用动作级后门攻击框架UNIDOOR,通过性能监控自适应探索后门奖励函数,在多种攻击场景下显著提升攻击性能并保持隐蔽性。
English: This paper introduces UNIDOOR, the first universal action-level backdoor attack framework for deep reinforcement learning, which adaptively explores backdoor reward functions through performance monitoring and demonstrates superior attack performance across diverse scenarios while maintaining stealthiness.

Authors:Xingjian Zhang, Xi Weng, Yihao Yue, Zhaoxin Fan, Wenjun Wu, Lei Huang
Title: TinyLLaVA-Video: Towards Smaller LMMs for Video Understanding with Group Resampler
Abstract:
Video behavior recognition and scene understanding are fundamental tasks in multimodal intelligence, serving as critical building blocks for numerous real-world applications. Through large multimodal models (LMMs) have achieved remarkable progress in video understanding, most existing open-source models rely on over 7B parameters and require large-scale datasets for training, making them resource-intensive and inaccessible to many researchers. Furthermore, lightweight models face persistent challenges in effectively processing long visual sequences and temporal understanding. In this work, we introduce TinyLLaVA-Video, a lightweight yet powerful video understanding model with approximately 3.6B parameters. The cornerstone of our design is the video-level group resampler, a novel mechanism that significantly reduces and controls the number of visual tokens at the video level. Unlike traditional image-level resampler, our approach effectively mitigates redundancy while enhancing temporal comprehension, leading to improved performance on video-based tasks. In addition, TinyLLaVA-Video demonstrates exceptional efficiency, requiring only one day of training on 8 A100-40G GPUs. It surpasses several existing 7B-parameter models on multiple benchmarks. We believe this work provides a valuable foundation for future research on lightweight video understanding models. The code and weights is available at https://github.com/ZhangXJ199/TinyLLaVA-Video.
中文摘要:本文提出TinyLLaVA-Video这一轻量级3.6B参数模型,通过创新的视频级分组重采样器有效减少视觉标记冗余并增强时序理解能力,在仅需少量训练资源的情况下实现了超越多个7B参数模型的性能表现。
English Summary: This paper introduces TinyLLaVA-Video, a lightweight 3.6B-parameter model that uses a novel video-level group resampler to reduce visual token redundancy and enhance temporal understanding, achieving superior performance over larger models while requiring minimal training resources.

Authors:JiaKui Hu, Lujia Jin, Zhengjian Yao, Yanye Lu
Title: Universal Image Restoration Pre-training via Degradation Classification
Abstract:
This paper proposes the Degradation Classification Pre-Training (DCPT), which enables models to learn how to classify the degradation type of input images for universal image restoration pre-training. Unlike the existing self-supervised pre-training methods, DCPT utilizes the degradation type of the input image as an extremely weak supervision, which can be effortlessly obtained, even intrinsic in all image restoration datasets. DCPT comprises two primary stages. Initially, image features are extracted from the encoder. Subsequently, a lightweight decoder, such as ResNet18, is leveraged to classify the degradation type of the input image solely based on the features extracted in the first stage, without utilizing the input image. The encoder is pre-trained with a straightforward yet potent DCPT, which is used to address universal image restoration and achieve outstanding performance. Following DCPT, both convolutional neural networks (CNNs) and transformers demonstrate performance improvements, with gains of up to 2.55 dB in the 10D all-in-one restoration task and 6.53 dB in the mixed degradation scenarios. Moreover, previous self-supervised pretraining methods, such as masked image modeling, discard the decoder after pre-training, while our DCPT utilizes the pre-trained parameters more effectively. This superiority arises from the degradation classifier acquired during DCPT, which facilitates transfer learning between models of identical architecture trained on diverse degradation types. Source code and models are available at https://github.com/MILab-PKU/dcpt.
Chinese: 本文提出退化分类预训练(DCPT)方法,通过将图像退化类型分类作为弱监督信号进行通用图像恢复预训练,在多种网络架构和任务中均实现了显著的性能提升。
English: This paper introduces Degradation Classification Pre-Training (DCPT), a method that uses degradation type classification as weak supervision to pre-train models for universal image restoration, achieving significant performance gains across various architectures and tasks.

Authors:Dan Song, Shumeng Huo, Wenhui Li, Lanjun Wang, Chao Xue, An-An Liu
Title: Domain Adaptation from Generated Multi-Weather Images for Unsupervised Maritime Object Classification
Abstract:
The classification and recognition of maritime objects are crucial for enhancing maritime safety, monitoring, and intelligent sea environment prediction. However, existing unsupervised methods for maritime object classification often struggle with the long-tail data distributions in both object categories and weather conditions. In this paper, we construct a dataset named AIMO produced by large-scale generative models with diverse weather conditions and balanced object categories, and collect a dataset named RMO with real-world images where long-tail issue exists. We propose a novel domain adaptation approach that leverages AIMO (source domain) to address the problem of limited labeled data, unbalanced distribution and domain shift in RMO (target domain), enhance the generalization of source features with the Vision-Language Models such as CLIP, and propose a difficulty score for curriculum learning to optimize training process. Experimental results shows that the proposed method significantly improves the classification accuracy, particularly for samples within rare object categories and weather conditions. Datasets and codes will be publicly available at https://github.com/honoria0204/AIMO.
中文摘要:本文提出一种新颖的域自适应方法,利用生成数据集AIMO解决真实海域物体分类中的长尾分布和域偏移问题,通过CLIP增强特征和课程学习显著提升了稀有类别和恶劣天气条件下的分类准确率。
English Summary: This paper introduces a novel domain adaptation method using a generative dataset (AIMO) to address long-tail distribution and domain shift in real maritime object classification, significantly improving accuracy for rare categories and weather conditions through CLIP-enhanced features and curriculum learning.

Authors:Tong Lei, Kyle T. Rizzo, Brian N. Bailey
Title: PhoTorch: A robust and generalized biochemical photosynthesis model fitting package based on PyTorch
Abstract:
Advancements in artificial intelligence (AI) have greatly benefited plant phenotyping and predictive modeling. However, unrealized opportunities exist in leveraging AI advancements in model parameter optimization for parameter fitting in complex biophysical models. This work developed novel software, PhoTorch, for fitting parameters of the Farquhar, von Caemmerer, and Berry (FvCB) biochemical photosynthesis model based the parameter optimization components of the popular AI framework PyTorch. The primary novelty of the software lies in its computational efficiency, robustness of parameter estimation, and flexibility in handling different types of response curves and sub-model functional forms. PhoTorch can fit both steady-state and non-steady-state gas exchange data with high efficiency and accuracy. Its flexibility allows for optional fitting of temperature and light response parameters, and can simultaneously fit light response curves and standard A/Ci curves. These features are not available within presently available A/Ci curve fitting packages. Results illustrated the robustness and efficiency of PhoTorch in fitting A/Ci curves with high variability and some level of artifacts and noise. PhoTorch is more than four times faster than benchmark software, which may be relevant when processing many non-steady-state A/Ci curves with hundreds of data points per curve. PhoTorch provides researchers from various fields with a reliable and efficient tool for analyzing photosynthetic data. The Python package is openly accessible from the repository: https://github.com/GEMINI-Breeding/photorch.
Chinese: 本研究开发了Photorch软件,利用PyTorch的AI优化技术高效拟合FvCB光合作用模型参数,相比现有工具在处理稳态和非稳态气体交换数据时具有更强的灵活性和更快的速度。
English: This research introduces PhoTorch, a novel software that utilizes PyTorch's AI optimization for efficient and robust parameter fitting in the FvCB photosynthesis model, offering enhanced flexibility and speed over existing tools for analyzing both steady-state and non-steady-state gas exchange data.

Authors:Zhenkai Wu, Xiaowen Ma, Rongrong Lian, Kai Zheng, Mengting Ma, Wei Zhang, Siyang Song
Title: CD-Lamba: Boosting Remote Sensing Change Detection via a Cross-Temporal Locally Adaptive State Space Model
Abstract:
Mamba, with its advantages of global perception and linear complexity, has been widely applied to identify changes of the target regions within the remote sensing (RS) images captured under complex scenarios and varied conditions. However, existing remote sensing change detection (RSCD) approaches based on Mamba frequently struggle to effectively perceive the inherent locality of change regions as they direct flatten and scan RS images (i.e., the features of the same region of changes are not distributed continuously within the sequence but are mixed with features from other regions throughout the sequence). In this paper, we propose a novel locally adaptive SSM-based approach, termed CD-Lamba, which effectively enhances the locality of change detection while maintaining global perception. Specifically, our CD-Lamba includes a Locally Adaptive State-Space Scan (LASS) strategy for locality enhancement, a Cross-Temporal State-Space Scan (CTSS) strategy for bi-temporal feature fusion, and a Window Shifting and Perception (WSP) mechanism to enhance interactions across segmented windows. These strategies are integrated into a multi-scale Cross-Temporal Locally Adaptive State-Space Scan (CT-LASS) module to effectively highlight changes and refine changes' representations feature generation. CD-Lamba significantly enhances local-global spatio-temporal interactions in bi-temporal images, offering improved performance in RSCD tasks. Extensive experimental results show that CD-Lamba achieves state-of-the-art performance on four benchmark datasets with a satisfactory efficiency-accuracy trade-off. Our code is publicly available at https://github.com/xwmaxwma/rschange.
中文: 基于Mamba的遥感变化检测方法因扁平化扫描难以有效感知局部变化区域,因此提出的CD-Lamba通过局部自适应和跨时序扫描策略增强时空交互,在基准数据集上实现了最优性能。
English: Mamba-based remote sensing change detection methods often fail to effectively capture local change regions due to flattened scanning, so the proposed CD-Lamba introduces locally adaptive and cross-temporal scanning strategies to enhance local-global spatio-temporal interactions, achieving state-of-the-art performance on benchmark datasets.

Authors:Zengran Wang, Yanan Zhang, Jiaxin Chen, Di Huang
Title: Breaking the SSL-AL Barrier: A Synergistic Semi-Supervised Active Learning Framework for 3D Object Detection
Abstract:
To address the annotation burden in LiDAR-based 3D object detection, active learning (AL) methods offer a promising solution. However, traditional active learning approaches solely rely on a small amount of labeled data to train an initial model for data selection, overlooking the potential of leveraging the abundance of unlabeled data. Recently, attempts to integrate semi-supervised learning (SSL) into AL with the goal of leveraging unlabeled data have faced challenges in effectively resolving the conflict between the two paradigms, resulting in less satisfactory performance. To tackle this conflict, we propose a Synergistic Semi-Supervised Active Learning framework, dubbed as S-SSAL. Specifically, from the perspective of SSL, we propose a Collaborative PseudoScene Pre-training (CPSP) method that effectively learns from unlabeled data without introducing adverse effects. From the perspective of AL, we design a Collaborative Active Learning (CAL) method, which complements the uncertainty and diversity methods by model cascading. This allows us to fully exploit the potential of the CPSP pre-trained model. Extensive experiments conducted on KITTI and Waymo demonstrate the effectiveness of our S-SSAL framework. Notably, on the KITTI dataset, utilizing only 2% labeled data, S-SSAL can achieve performance comparable to models trained on the full dataset. The code has been released at https://github.com/LandDreamer/S_SSAL.
Chinese: 为解决激光雷达3D物体检测中的标注负担,S-SSAL框架通过协同伪场景预训练和协同主动学习方法,将半监督学习与主动学习有机结合,在KITTI数据集上仅使用2%标注数据即可达到全监督模型的性能水平。
English: To address the annotation burden in LiDAR-based 3D object detection, the proposed S-SSAL framework synergistically integrates semi-supervised learning with active learning through Collaborative PseudoScene Pre-training and Collaborative Active Learning, achieving performance comparable to fully supervised models using only 2% labeled data on KITTI.

Authors:Jiaqi Li, Xueyao Zhang, Yuancheng Wang, Haorui He, Chaoren Wang, Li Wang, Huan Liao, Junyi Ao, Zeyu Xie, Yiqiao Huang, Junan Zhang, Zhizheng Wu
Title: Overview of the Amphion Toolkit (v0.2)
Abstract:
Amphion is an open-source toolkit for Audio, Music, and Speech Generation, designed to lower the entry barrier for junior researchers and engineers in these fields. It provides a versatile framework that supports a variety of generation tasks and models. In this report, we introduce Amphion v0.2, the second major release developed in 2024. This release features a 100K-hour open-source multilingual dataset, a robust data preparation pipeline, and novel models for tasks such as text-to-speech, audio coding, and voice conversion. Furthermore, the report includes multiple tutorials that guide users through the functionalities and usage of the newly released models.
中文: Amphion v0.2 是2024年发布的第二代开源音频工具包,提供多功能生成框架,包含10万小时多语言数据集、强大数据处理流程及文本转语音等新型模型,并配有详细使用教程。
English: Amphion v0.2 is an open-source toolkit released in 2024 that provides a versatile framework for audio, music, and speech generation, featuring a 100K-hour multilingual dataset, robust data pipelines, and novel models for tasks like text-to-speech and voice conversion, along with comprehensive tutorials.

Authors:Han Wang, Rui Yang Tan, Roy Ka-Wei Lee
Title: Cross-Modal Transfer from Memes to Videos: Addressing Data Scarcity in Hateful Video Detection
Abstract:
Detecting hate speech in online content is essential to ensuring safer digital spaces. While significant progress has been made in text and meme modalities, video-based hate speech detection remains under-explored, hindered by a lack of annotated datasets and the high cost of video annotation. This gap is particularly problematic given the growing reliance on large models, which demand substantial amounts of training data. To address this challenge, we leverage meme datasets as both a substitution and an augmentation strategy for training hateful video detection models. Our approach introduces a human-assisted reannotation pipeline to align meme dataset labels with video datasets, ensuring consistency with minimal labeling effort. Using two state-of-the-art vision-language models, we demonstrate that meme data can substitute for video data in resource-scarce scenarios and augment video datasets to achieve further performance gains. Our results consistently outperform state-of-the-art benchmarks, showcasing the potential of cross-modal transfer learning for advancing hateful video detection. Dataset and code are available at https://github.com/Social-AI-Studio/CrossModalTransferLearning.
中文摘要:本研究提出一种跨模态迁移学习方法,利用表情包数据集训练仇恨视频检测模型,通过人工辅助标签对齐和数据增强策略实现了最先进的性能表现。
English Summary: This study introduces a cross-modal transfer learning approach that utilizes meme datasets to train hateful video detection models, achieving state-of-the-art performance through human-assisted label alignment and data augmentation strategies.

Authors:Hossein Mirzaei, Mojtaba Nafez, Jafar Habibi, Mohammad Sabokrou, Mohammad Hossein Rohban
Title: Mitigating Spurious Negative Pairs for Robust Industrial Anomaly Detection
Abstract:
Despite significant progress in Anomaly Detection (AD), the robustness of existing detection methods against adversarial attacks remains a challenge, compromising their reliability in critical real-world applications such as autonomous driving. This issue primarily arises from the AD setup, which assumes that training data is limited to a group of unlabeled normal samples, making the detectors vulnerable to adversarial anomaly samples during testing. Additionally, implementing adversarial training as a safeguard encounters difficulties, such as formulating an effective objective function without access to labels. An ideal objective function for adversarial training in AD should promote strong perturbations both within and between the normal and anomaly groups to maximize margin between normal and anomaly distribution. To address these issues, we first propose crafting a pseudo-anomaly group derived from normal group samples. Then, we demonstrate that adversarial training with contrastive loss could serve as an ideal objective function, as it creates both inter- and intra-group perturbations. However, we notice that spurious negative pairs compromise the conventional contrastive loss to achieve robust AD. Spurious negative pairs are those that should be closely mapped but are erroneously separated. These pairs introduce noise and misguide the direction of inter-group adversarial perturbations. To overcome the effect of spurious negative pairs, we define opposite pairs and adversarially pull them apart to strengthen inter-group perturbations. Experimental results demonstrate our superior performance in both clean and adversarial scenarios, with a 26.1% improvement in robust detection across various challenging benchmark datasets. The implementation of our work is available at: https://github.com/rohban-lab/COBRA.
中文摘要:本研究针对异常检测系统在对抗攻击下的脆弱性问题,提出了一种基于对比损失和伪异常样本生成的新型对抗训练方法,通过在基准测试中实现26.1%的鲁棒性提升,有效增强了系统防御能力。
English Summary: This study addresses the vulnerability of anomaly detection systems to adversarial attacks by proposing a novel adversarial training method using contrastive loss and pseudo-anomaly generation, which significantly enhances robustness by 26.1% across various benchmarks.

Authors:Junrui Liu, Tong Li, Di Wu, Zifang Tang, Yuan Fang, Zhen Yang
Title: An Aspect Performance-aware Hypergraph Neural Network for Review-based Recommendation
Abstract:
Online reviews allow consumers to provide detailed feedback on various aspects of items. Existing methods utilize these aspects to model users' fine-grained preferences for specific item features through graph neural networks. We argue that the performance of items on different aspects is important for making precise recommendations, which has not been taken into account by existing approaches, due to lack of data. In this paper, we propose an aspect performance-aware hypergraph neural network (APH) for the review-based recommendation, which learns the performance of items from the conflicting sentiment polarity of user reviews. Specifically, APH comprehensively models the relationships among users, items, aspects, and sentiment polarity by systematically constructing an aspect hypergraph based on user reviews. In addition, APH aggregates aspects representing users and items by employing an aspect performance-aware hypergraph aggregation method. It aggregates the sentiment polarities from multiple users by jointly considering user preferences and the semantics of their sentiments, determining the weights of sentiment polarities to infer the performance of items on various aspects. Such performances are then used as weights to aggregate neighboring aspects. Experiments on six real-world datasets demonstrate that APH improves MSE, Precision@5, and Recall@5 by an average of 2.30%, 4.89%, and 1.60% over the best baseline. The source code and data are available at https://github.com/dianziliu/APH.
中文摘要:本文提出APH模型,一种基于超图神经网络的方面性能感知推荐系统,通过分析用户评论中的矛盾情感来学习物品在不同方面的表现,从而提升推荐精度。
English Summary: The paper introduces APH, an aspect performance-aware hypergraph neural network that leverages conflicting review sentiments to model item performance across different aspects, enhancing recommendation accuracy by integrating user preferences and sentiment semantics.

Authors:Liang Shang, William A. Sethares, Anusha Adluru, Andrew L. Alexander, Vivek Prabhakaran, Veena A. Nair, Nagesh Adluru
Title: Stroke Lesion Segmentation using Multi-Stage Cross-Scale Attention
Abstract:
Precise characterization of stroke lesions from MRI data has immense value in prognosticating clinical and cognitive outcomes following a stroke. Manual stroke lesion segmentation is time-consuming and requires the expertise of neurologists and neuroradiologists. Often, lesions are grossly characterized for their location and overall extent using bounding boxes without specific delineation of their boundaries. While such characterization provides some clinical value, to develop a precise mechanistic understanding of the impact of lesions on post-stroke vascular contributions to cognitive impairments and dementia (VCID), the stroke lesions need to be fully segmented with accurate boundaries. This work introduces the Multi-Stage Cross-Scale Attention (MSCSA) mechanism, applied to the U-Net family, to improve the mapping between brain structural features and lesions of varying sizes. Using the Anatomical Tracings of Lesions After Stroke (ATLAS) v2.0 dataset, MSCSA outperforms all baseline methods in both Dice and F1 scores on a subset focusing on small lesions, while maintaining competitive performance across the entire dataset. Notably, the ensemble strategy incorporating MSCSA achieves the highest scores for Dice and F1 on both the full dataset and the small lesion subset. These results demonstrate the effectiveness of MSCSA in segmenting small lesions and highlight its robustness across different training schemes for large stroke lesions. Our code is available at: https://github.com/nadluru/StrokeLesSeg.
中文: 多阶段跨尺度注意力机制应用于U-Net架构,通过增强脑部结构特征与不同尺寸病灶的映射关系,在ATLAS v2.0数据集中对小病灶分割表现优异,并在整体数据集上保持稳健性能。
English: The Multi-Stage Cross-Scale Attention (MSCSA) mechanism enhances U-Net-based stroke lesion segmentation by effectively mapping brain structural features to lesions of varying sizes, achieving superior performance on small lesions and robust results across the ATLAS v2.0 dataset.

Authors:Huayu Chen, Kai Jiang, Kaiwen Zheng, Jianfei Chen, Hang Su, Jun Zhu
Title: Visual Generation Without Guidance
Abstract:
Classifier-Free Guidance (CFG) has been a default technique in various visual generative models, yet it requires inference from both conditional and unconditional models during sampling. We propose to build visual models that are free from guided sampling. The resulting algorithm, Guidance-Free Training (GFT), matches the performance of CFG while reducing sampling to a single model, halving the computational cost. Unlike previous distillation-based approaches that rely on pretrained CFG networks, GFT enables training directly from scratch. GFT is simple to implement. It retains the same maximum likelihood objective as CFG and differs mainly in the parameterization of conditional models. Implementing GFT requires only minimal modifications to existing codebases, as most design choices and hyperparameters are directly inherited from CFG. Our extensive experiments across five distinct visual models demonstrate the effectiveness and versatility of GFT. Across domains of diffusion, autoregressive, and masked-prediction modeling, GFT consistently achieves comparable or even lower FID scores, with similar diversity-fidelity trade-offs compared with CFG baselines, all while being guidance-free. Code will be available at https://github.com/thu-ml/GFT.
Chinese: 提出的无引导训练(GFT)方法通过从头开始训练视觉模型,无需引导采样即可达到分类器自由引导的性能,同时在多个生成领域中保持相似的多样性-保真度权衡,并将计算成本减半。
English: The proposed Guidance-Free Training (GFT) method eliminates the need for guided sampling by training visual models directly from scratch, matching the performance of Classifier-Free Guidance while halving computational costs and maintaining similar diversity-fidelity trade-offs across multiple generative domains.

Authors:Siqi Fan, Yuguang Xie, Bowen Cai, Ailin Xie, Gaochao Liu, Mu Qiao, Jie Xing, Zaiqing Nie
Title: OCSU: Optical Chemical Structure Understanding for Molecule-centric Scientific Discovery
Abstract:
Understanding the chemical structure from a graphical representation of a molecule is a challenging image caption task that would greatly benefit molecule-centric scientific discovery. Variations in molecular images and caption subtasks pose a significant challenge in both image representation learning and task modeling. Yet, existing methods only focus on a specific caption task that translates a molecular image into its graph structure, i.e., OCSR. In this paper, we propose the Optical Chemical Structure Understanding (OCSU) task, which extends low-level recognition to multilevel understanding and aims to translate chemical structure diagrams into readable strings for both machine and chemist. To facilitate the development of OCSU technology, we explore both OCSR-based and OCSR-free paradigms. We propose DoubleCheck to enhance OCSR performance via attentive feature enhancement for local ambiguous atoms. It can be cascaded with existing SMILES-based molecule understanding methods to achieve OCSU. Meanwhile, Mol-VL is a vision-language model end-to-end optimized for OCSU. We also construct Vis-CheBI20, the first large-scale OCSU dataset. Through comprehensive experiments, we demonstrate the proposed approaches excel at providing chemist-readable caption for chemical structure diagrams, which provide solid baselines for further research. Our code, model, and data are open-sourced at https://github.com/PharMolix/OCSU.
中文摘要:本文提出光学化学结构理解(OCSU)任务,将分子图像转化为机器可读且化学家可理解的描述,通过两种创新方法和新构建的数据集显著提升了分子图像解析性能。
English Summary: This paper introduces the Optical Chemical Structure Understanding (OCSU) task to translate molecular diagrams into machine-readable and chemist-friendly captions, proposing two innovative methods and a new dataset that outperform existing approaches.

Authors:Zhiming Wang, Lin Gu, Feng Lu
Title: TdAttenMix: Top-Down Attention Guided Mixup
Abstract:
CutMix is a data augmentation strategy that cuts and pastes image patches to mixup training data. Existing methods pick either random or salient areas which are often inconsistent to labels, thus misguiding the training model. By our knowledge, we integrate human gaze to guide cutmix for the first time. Since human attention is driven by both high-level recognition and low-level clues, we propose a controllable Top-down Attention Guided Module to obtain a general artificial attention which balances top-down and bottom-up attention. The proposed TdATttenMix then picks the patches and adjust the label mixing ratio that focuses on regions relevant to the current label. Experimental results demonstrate that our TdAttenMix outperforms existing state-of-the-art mixup methods across eight different benchmarks. Additionally, we introduce a new metric based on the human gaze and use this metric to investigate the issue of image-label inconsistency. Project page: \url{https://github.com/morning12138/TdAttenMix}
中文: TdAttenMix是一种新型CutMix数据增强方法,首次通过自上而下的注意力模块整合人类注视信息,选择与标签一致的图像区域并调整混合比例,在八个基准测试中实现最优性能,同时解决了图像标签不一致问题。
English: TdAttenMix is a novel CutMix data augmentation method that integrates human gaze through a top-down attention module to select label-consistent image patches and adjust mixing ratios, achieving superior performance across eight benchmarks while addressing image-label inconsistency.

Authors:Guanglin Niu, Xiaowei Zhang
Title: Diffusion-based Hierarchical Negative Sampling for Multimodal Knowledge Graph Completion
Abstract:
Multimodal Knowledge Graph Completion (MMKGC) aims to address the critical issue of missing knowledge in multimodal knowledge graphs (MMKGs) for their better applications. However, both the previous MMGKC and negative sampling (NS) approaches ignore the employment of multimodal information to generate diverse and high-quality negative triples from various semantic levels and hardness levels, thereby limiting the effectiveness of training MMKGC models. Thus, we propose a novel Diffusion-based Hierarchical Negative Sampling (DHNS) scheme tailored for MMKGC tasks, which tackles the challenge of generating high-quality negative triples by leveraging a Diffusion-based Hierarchical Embedding Generation (DiffHEG) that progressively conditions on entities and relations as well as multimodal semantics. Furthermore, we develop a Negative Triple-Adaptive Training (NTAT) strategy that dynamically adjusts training margins associated with the hardness level of the synthesized negative triples, facilitating a more robust and effective learning procedure to distinguish between positive and negative triples. Extensive experiments on three MMKGC benchmark datasets demonstrate that our framework outperforms several state-of-the-art MMKGC models and negative sampling techniques, illustrating the effectiveness of our DHNS for training MMKGC models. The source codes and datasets of this paper are available at https://github.com/ngl567/DHNS.
Chinese: 本研究针对多模态知识图谱补全任务,提出了一种基于扩散的分层负采样框架,通过利用多模态信息生成高质量负三元组并结合自适应训练策略,显著提升了模型性能,在多个基准数据集上验证了其有效性。
English: This study introduces a Diffusion-based Hierarchical Negative Sampling (DHNS) framework for Multimodal Knowledge Graph Completion, which generates high-quality negative triples using multimodal information and adaptive training to enhance model performance, as validated by superior results on benchmark datasets.

Authors:Hao Shu, Jicheng Li, Yu Jin, Hailin Wang
Title: Guaranteed Multidimensional Time Series Prediction via Deterministic Tensor Completion Theory
Abstract:
In recent years, the prediction of multidimensional time series data has become increasingly important due to its wide-ranging applications. Tensor-based prediction methods have gained attention for their ability to preserve the inherent structure of such data. However, existing approaches, such as tensor autoregression and tensor decomposition, often have consistently failed to provide clear assertions regarding the number of samples that can be exactly predicted. While matrix-based methods using nuclear norms address this limitation, their reliance on matrices limits accuracy and increases computational costs when handling multidimensional data. To overcome these challenges, we reformulate multidimensional time series prediction as a deterministic tensor completion problem and propose a novel theoretical framework. Specifically, we develop a deterministic tensor completion theory and introduce the Temporal Convolutional Tensor Nuclear Norm (TCTNN) model. By convolving the multidimensional time series along the temporal dimension and applying the tensor nuclear norm, our approach identifies the maximum forecast horizon for exact predictions. Additionally, TCTNN achieves superior performance in prediction accuracy and computational efficiency compared to existing methods across diverse real-world datasets, including climate temperature, network flow, and traffic ride data. Our implementation is publicly available at https://github.com/HaoShu2000/TCTNN.
中文摘要:本研究提出了一种新颖的张量补全框架,通过时序卷积张量核范数(TCTNN)确定精确预测范围,并在多维时间序列预测中实现了更优的精度和计算效率。
English Summary: This study introduces a novel tensor completion framework using Temporal Convolutional Tensor Nuclear Norm (TCTNN) to determine exact prediction horizons and achieve superior accuracy and efficiency in multidimensional time series forecasting.

Authors:Long Yang, Lianqing Zheng, Wenjin Ai, Minghao Liu, Sen Li, Qunshu Lin, Shengyu Yan, Jie Bai, Zhixiong Ma, Tao Huang, Xichan Zhu
Title: MetaOcc: Spatio-Temporal Fusion of Surround-View 4D Radar and Camera for 3D Occupancy Prediction with Dual Training Strategies
Abstract:
Robust 3D occupancy prediction is essential for autonomous driving, particularly under adverse weather conditions where traditional vision-only systems struggle. While the fusion of surround-view 4D radar and cameras offers a promising low-cost solution, effectively extracting and integrating features from these heterogeneous sensors remains challenging. This paper introduces MetaOcc, a novel multi-modal framework for omnidirectional 3D occupancy prediction that leverages both multi-view 4D radar and images. To address the limitations of directly applying LiDAR-oriented encoders to sparse radar data, we propose a Radar Height Self-Attention module that enhances vertical spatial reasoning and feature extraction. Additionally, a Hierarchical Multi-scale Multi-modal Fusion strategy is developed to perform adaptive local-global fusion across modalities and time, mitigating spatio-temporal misalignments and enriching fused feature representations. To reduce reliance on expensive point cloud annotations, we further propose a pseudo-label generation pipeline based on an open-set segmentor. This enables a semi-supervised strategy that achieves 90% of the fully supervised performance using only 50% of the ground truth labels, offering an effective trade-off between annotation cost and accuracy. Extensive experiments demonstrate that MetaOcc under full supervision achieves state-of-the-art performance, outperforming previous methods by +0.47 SC IoU and +4.02 mIoU on the OmniHD-Scenes dataset, and by +1.16 SC IoU and +1.24 mIoU on the SurroundOcc-nuScenes dataset. These results demonstrate the scalability and robustness of MetaOcc across sensor domains and training conditions, paving the way for practical deployment in real-world autonomous systems. Code and data are available at https://github.com/LucasYang567/MetaOcc.
中文摘要:本文提出MetaOcc多模态框架,通过融合4D雷达与相机数据实现自动驾驶中的全方位3D占据预测,采用创新的特征提取与分层融合策略,在减少标注成本的同时达到最优性能表现。
English Summary: This paper introduces MetaOcc, a multi-modal framework that combines 4D radar and camera data for robust 3D occupancy prediction in autonomous driving, achieving state-of-the-art performance through novel feature extraction and fusion techniques while reducing annotation costs via semi-supervised learning.

Authors:Chuanyang Zheng
Title: iFormer: Integrating ConvNet and Transformer for Mobile Application
Abstract:
We present a new family of mobile hybrid vision networks, called iFormer, with a focus on optimizing latency and accuracy on mobile applications. iFormer effectively integrates the fast local representation capacity of convolution with the efficient global modeling ability of self-attention. The local interactions are derived from transforming a standard convolutional network, \textit{i.e.}, ConvNeXt, to design a more lightweight mobile network. Our newly introduced mobile modulation attention removes memory-intensive operations in MHA and employs an efficient modulation mechanism to boost dynamic global representational capacity. We conduct comprehensive experiments demonstrating that iFormer outperforms existing lightweight networks across various tasks. Notably, iFormer achieves an impressive Top-1 accuracy of 80.4\% on ImageNet-1k with a latency of only 1.10 ms on an iPhone 13, surpassing the recently proposed MobileNetV4 under similar latency constraints. Additionally, our method shows significant improvements in downstream tasks, including COCO object detection, instance segmentation, and ADE20k semantic segmentation, while still maintaining low latency on mobile devices for high-resolution inputs in these scenarios.
中文: iFormer是一种新型移动混合视觉网络,通过结合卷积的局部处理能力和自注意力的全局建模优势,在移动设备上实现了卓越的精度与低延迟,在ImageNet分类和COCO目标检测等任务中超越了现有模型。
English: iFormer is a novel mobile hybrid vision network that integrates convolution's local processing with self-attention's global modeling to achieve superior accuracy and low latency on mobile devices, outperforming existing models in tasks like ImageNet classification and COCO object detection.

Authors:Shiyao Sun, Kapil Khandelwal
Title: Structural Symmetry, Multiplicity, and Differentiability of Eigenfrequencies
Abstract:
This work investigates the multiplicity and differentiability of eigenfrequencies in structures with various symmetries. In particular, the study explores how the geometric and design variable symmetries affect the distribution of eigenvalues, distinguishing between simple and multiple eigenvalues in 3-D trusses. Moreover, this article also examines the differentiability of multiple eigenvalues under various symmetry conditions, which is crucial for gradient-based optimization. The results presented in this study show that while full symmetry ensures the differentiability of all eigenvalues, increased symmetry in optimized design, such as accidental symmetry, may lead to non-differentiable eigenvalues. Additionally, the study presents solutions using symmetric functions, demonstrating their effectiveness in ensuring differentiability in scenarios where multiple eigenvalues are non-differentiable. The study also highlights a critical insight into the differentiability criterion of symmetric functions, i.e., the completeness of eigen-clusters, which is necessary to ensure the differentiability of such functions.
中文: 本研究探讨结构对称性如何影响特征频率的多样性和可微性,发现完全对称虽确保特征值可微,但过度对称可能导致不可微,而对称函数通过保持特征簇完整性可有效解决此问题。
English: This study examines how structural symmetries influence eigenfrequency multiplicity and differentiability, revealing that while full symmetry ensures differentiable eigenvalues, excessive symmetry may cause non-differentiability, with symmetric functions providing solutions by maintaining eigen-cluster completeness.

Authors:Zhikai Chen, Han Xie, Jian Zhang, Xiang song, Jiliang Tang, Huzefa Rangwala, George Karypis
Title: AutoG: Towards automatic graph construction from tabular data
Abstract:
Recent years have witnessed significant advancements in graph machine learning (GML), with its applications spanning numerous domains. However, the focus of GML has predominantly been on developing powerful models, often overlooking a crucial initial step: constructing suitable graphs from common data formats, such as tabular data. This construction process is fundamental to applying graph-based models, yet it remains largely understudied and lacks formalization. Our research aims to address this gap by formalizing the graph construction problem and proposing an effective solution. We identify two critical challenges to achieve this goal: 1. The absence of dedicated datasets to formalize and evaluate the effectiveness of graph construction methods, and 2. Existing automatic construction methods can only be applied to some specific cases, while tedious human engineering is required to generate high-quality graphs. To tackle these challenges, we present a two-fold contribution. First, we introduce a set of datasets to formalize and evaluate graph construction methods. Second, we propose an LLM-based solution, AutoG, automatically generating high-quality graph schemas without human intervention. The experimental results demonstrate that the quality of constructed graphs is critical to downstream task performance, and AutoG can generate high-quality graphs that rival those produced by human experts. Our code can be accessible from https://github.com/amazon-science/Automatic-Table-to-Graph-Generation.
Chinese: 近年来图机器学习虽发展迅速,却忽视了从表格数据构建图结构这一关键基础步骤;本研究通过提出AutoG这一基于大语言模型的解决方案,能够自动生成媲美人工设计的高质量图谱,有效填补了该领域的研究空白。
English: Recent graph machine learning advancements have overlooked the critical step of constructing graphs from tabular data, prompting this research to formalize the problem and introduce AutoG, an LLM-based solution that automatically generates high-quality graphs rivaling human expertise.

Authors:Hossein Mirzaei, Mohammad Jafari, Hamid Reza Dehbashi, Zeinab Sadat Taghavi, Mohammad Sabokrou, Mohammad Hossein Rohban
Title: Killing it with Zero-Shot: Adversarially Robust Novelty Detection
Abstract:
Novelty Detection (ND) plays a crucial role in machine learning by identifying new or unseen data during model inference. This capability is especially important for the safe and reliable operation of automated systems. Despite advances in this field, existing techniques often fail to maintain their performance when subject to adversarial attacks. Our research addresses this gap by marrying the merits of nearest-neighbor algorithms with robust features obtained from models pretrained on ImageNet. We focus on enhancing the robustness and performance of ND algorithms. Experimental results demonstrate that our approach significantly outperforms current state-of-the-art methods across various benchmarks, particularly under adversarial conditions. By incorporating robust pretrained features into the k-NN algorithm, we establish a new standard for performance and robustness in the field of robust ND. This work opens up new avenues for research aimed at fortifying machine learning systems against adversarial vulnerabilities. Our implementation is publicly available at https://github.com/rohban-lab/ZARND.
中文摘要:本研究通过将鲁棒的ImageNet预训练特征与k近邻算法相结合,显著提升了机器学习中新颖性检测在对抗攻击下的性能,为鲁棒性设立了新标准。
English Summary: Our research enhances novelty detection in machine learning by integrating robust ImageNet-pretrained features with k-NN algorithms, significantly outperforming existing methods under adversarial attacks and setting a new benchmark for robustness.

Authors:Pauline Bourigault, Danilo P. Mandic
Title: Kernel-Based Anomaly Detection Using Generalized Hyperbolic Processes
Abstract:
We present a novel approach to anomaly detection by integrating Generalized Hyperbolic (GH) processes into kernel-based methods. The GH distribution, known for its flexibility in modeling skewness, heavy tails, and kurtosis, helps to capture complex patterns in data that deviate from Gaussian assumptions. We propose a GH-based kernel function and utilize it within Kernel Density Estimation (KDE) and One-Class Support Vector Machines (OCSVM) to develop anomaly detection frameworks. Theoretical results confirmed the positive semi-definiteness and consistency of the GH-based kernel, ensuring its suitability for machine learning applications. Empirical evaluation on synthetic and real-world datasets showed that our method improves detection performance in scenarios involving heavy-tailed and asymmetric or imbalanced distributions. https://github.com/paulinebourigault/GHKernelAnomalyDetect
中文: 本研究提出了一种新颖的异常检测方法,通过将广义双曲过程融入基于核的技术,在处理重尾和不对称数据分布时显著提升了检测性能。
English: This study introduces a novel anomaly detection method that integrates Generalized Hyperbolic processes into kernel-based approaches, enhancing performance in handling heavy-tailed and asymmetric data distributions.

Authors:Yiqun Chen, Lingyong Yan, Weiwei Sun, Xinyu Ma, Yi Zhang, Shuaiqiang Wang, Dawei Yin, Yiming Yang, Jiaxin Mao
Title: Improving Retrieval-Augmented Generation through Multi-Agent Reinforcement Learning
Abstract:
Retrieval-augmented generation (RAG) is extensively utilized to incorporate external, current knowledge into large language models, thereby minimizing hallucinations. A standard RAG pipeline may comprise several components, such as query rewriting, document retrieval, document filtering, and answer generation. However, these components are typically optimized separately through supervised fine-tuning, which can lead to misalignments between the objectives of individual modules and the overarching aim of generating accurate answers in question-answering (QA) tasks. Although recent efforts have explored reinforcement learning (RL) to optimize specific RAG components, these approaches often focus on overly simplistic pipelines with only two components or do not adequately address the complex interdependencies and collaborative interactions among the modules. To overcome these challenges, we propose treating the RAG pipeline as a multi-agent cooperative task, with each component regarded as an RL agent. Specifically, we present MMOA-RAG, a Multi-Module joint Optimization Algorithm for RAG, which employs multi-agent reinforcement learning to harmonize all agents' goals towards a unified reward, such as the F1 score of the final answer. Experiments conducted on various QA datasets demonstrate that MMOA-RAG improves the overall pipeline performance and outperforms existing baselines. Furthermore, comprehensive ablation studies validate the contributions of individual components and the adaptability of MMOA-RAG across different RAG components and datasets. The code of MMOA-RAG is on https://github.com/chenyiqun/MMOA-RAG.
Chinese: 本文提出MMOA-RAG,通过多智能体强化学习将检索增强生成流程中的各个组件作为智能体进行联合优化,使所有模块目标统一于最终答案的F1分数等奖励指标,在多项问答任务中超越了现有基线方法。
English: The paper introduces MMOA-RAG, a multi-agent reinforcement learning approach that optimizes the entire retrieval-augmented generation pipeline by aligning all components toward a unified reward, improving performance on question-answering tasks over existing methods.

Authors:Hao Tang, Siyue Yu, Jian Pang, Bingfeng Zhang
Title: A Training-free Synthetic Data Selection Method for Semantic Segmentation
Abstract:
Training semantic segmenter with synthetic data has been attracting great attention due to its easy accessibility and huge quantities. Most previous methods focused on producing large-scale synthetic image-annotation samples and then training the segmenter with all of them. However, such a solution remains a main challenge in that the poor-quality samples are unavoidable, and using them to train the model will damage the training process. In this paper, we propose a training-free Synthetic Data Selection (SDS) strategy with CLIP to select high-quality samples for building a reliable synthetic dataset. Specifically, given massive synthetic image-annotation pairs, we first design a Perturbation-based CLIP Similarity (PCS) to measure the reliability of synthetic image, thus removing samples with low-quality images. Then we propose a class-balance Annotation Similarity Filter (ASF) by comparing the synthetic annotation with the response of CLIP to remove the samples related to low-quality annotations. The experimental results show that using our method significantly reduces the data size by half, while the trained segmenter achieves higher performance. The code is released at https://github.com/tanghao2000/SDS.
中文摘要:本文提出了一种基于CLIP的无训练合成数据选择方法,通过双重质量筛选机制构建可靠数据集,在将数据量减半的同时显著提升了分割模型的性能。
English Summary: The paper introduces a training-free Synthetic Data Selection (SDS) method using CLIP to filter high-quality synthetic data for semantic segmentation, which halves dataset size while improving model performance.

Authors:Aitor Sánchez-Ferrera, Borja Calvo, Jose A. Lozano
Title: A Review on Self-Supervised Learning for Time Series Anomaly Detection: Recent Advances and Open Challenges
Abstract:
Time series anomaly detection presents various challenges due to the sequential and dynamic nature of time-dependent data. Traditional unsupervised methods frequently encounter difficulties in generalization, often overfitting to known normal patterns observed during training and struggling to adapt to unseen normality. In response to this limitation, self-supervised techniques for time series have garnered attention as a potential solution to undertake this obstacle and enhance the performance of anomaly detectors. This paper presents a comprehensive review of the recent methods that make use of self-supervised learning for time series anomaly detection. A taxonomy is proposed to categorize these methods based on their primary characteristics, facilitating a clear understanding of their diversity within this field. The information contained in this survey, along with additional details that will be periodically updated, is available on the following GitHub repository: https://github.com/Aitorzan3/Awesome-Self-Supervised-Time-Series-Anomaly-Detection.
中文: 本文综述了自监督学习在时间序列异常检测中的应用,提出了分类法以系统归类现有方法,旨在克服传统无监督方法的泛化不足问题。
English: This paper reviews self-supervised learning methods for time series anomaly detection, proposing a taxonomy to categorize them and addressing the limitations of traditional unsupervised approaches.

Authors:Zhihao Yao, Jixuan Yin, Bo Li
Title: Reliable Pseudo-labeling via Optimal Transport with Attention for Short Text Clustering
Abstract:
Short text clustering has gained significant attention in the data mining community. However, the limited valuable information contained in short texts often leads to low-discriminative representations, increasing the difficulty of clustering. This paper proposes a novel short text clustering framework, called Reliable \textbf{P}seudo-labeling via \textbf{O}ptimal \textbf{T}ransport with \textbf{A}ttention for Short Text Clustering (\textbf{POTA}), that generate reliable pseudo-labels to aid discriminative representation learning for clustering. Specially, \textbf{POTA} first implements an instance-level attention mechanism to capture the semantic relationships among samples, which are then incorporated as a semantic consistency regularization term into an optimal transport problem. By solving this OT problem, we can yield reliable pseudo-labels that simultaneously account for sample-to-sample semantic consistency and sample-to-cluster global structure information. Additionally, the proposed OT can adaptively estimate cluster distributions, making \textbf{POTA} well-suited for varying degrees of imbalanced datasets. Then, we utilize the pseudo-labels to guide contrastive learning to generate discriminative representations and achieve efficient clustering. Extensive experiments demonstrate \textbf{POTA} outperforms state-of-the-art methods. The code is available at: \href{https://github.com/YZH0905/POTA-STC/tree/main}{https://github.com/YZH0905/POTA-STC/tree/main}.
中文: 本文提出POTA短文本聚类框架,通过结合注意力机制与最优传输生成可靠伪标签,提升表征区分度,在非平衡数据集上表现优异。
English: This paper introduces POTA, a novel short text clustering framework that uses optimal transport with attention to generate reliable pseudo-labels, enhancing discriminative representation learning and achieving superior performance on imbalanced datasets.

Authors:Youssef Zaazou, Alex Bihlo, Terrence S. Tricco
Title: Mapping Galaxy Images Across Ultraviolet, Visible and Infrared Bands Using Generative Deep Learning
Abstract:
We demonstrate that generative deep learning can translate galaxy observations across ultraviolet, visible, and infrared photometric bands. Leveraging mock observations from the Illustris simulations, we develop and validate a supervised image-to-image model capable of performing both band interpolation and extrapolation. The resulting trained models exhibit high fidelity in generating outputs, as verified by both general image comparison metrics (MAE, SSIM, PSNR) and specialized astronomical metrics (GINI coefficient, M20). Moreover, we show that our model can be used to predict real-world observations, using data from the DECaLS survey as a case study. These findings highlight the potential of generative learning to augment astronomical datasets, enabling efficient exploration of multi-band information in regions where observations are incomplete. This work opens new pathways for optimizing mission planning, guiding high-resolution follow-ups, and enhancing our understanding of galaxy morphology and evolution.
中文: 生成式深度学习能够跨紫外、可见光和红外波段转换星系观测数据,通过高保真度的插值与外推增强天文数据集,从而优化任务规划并深化对星系形态和演化的理解。
English: Generative deep learning effectively translates galaxy observations across multiple photometric bands, enabling high-fidelity interpolation and extrapolation to augment astronomical datasets and enhance exploration of galaxy morphology.

Authors:Hulingxiao He, Geng Li, Zijun Geng, Jinglin Xu, Yuxin Peng
Title: Analyzing and Boosting the Power of Fine-Grained Visual Recognition for Multi-modal Large Language Models
Abstract:
Multi-modal large language models (MLLMs) have shown remarkable abilities in various visual understanding tasks. However, MLLMs still struggle with fine-grained visual recognition (FGVR), which aims to identify subordinate-level categories from images. This can negatively impact more advanced capabilities of MLLMs, such as object-centric visual question answering and reasoning. In our study, we revisit three quintessential capabilities of MLLMs for FGVR, including object information extraction, category knowledge reserve, object-category alignment, and position of the root cause as a misalignment problem. To address this issue, we present Finedefics, an MLLM that enhances the model's FGVR capability by incorporating informative attribute descriptions of objects into the training phase. We employ contrastive learning on object-attribute pairs and attribute-category pairs simultaneously and use examples from similar but incorrect categories as hard negatives, naturally bringing representations of visual objects and category names closer. Extensive evaluations across multiple popular FGVR datasets demonstrate that Finedefics outperforms existing MLLMs of comparable parameter sizes, showcasing its remarkable efficacy. The code is available at https://github.com/PKU-ICST-MIPL/Finedefics_ICLR2025.
Chinese: 本研究提出了Finedefics多模态大语言模型,通过引入物体属性描述并采用包含困难负样本的对比学习,显著提升了细粒度视觉识别能力,在多个数据集上展现出优越性能。
English: The study introduces Finedefics, a multi-modal large language model that enhances fine-grained visual recognition by incorporating object attribute descriptions and using contrastive learning with hard negatives, achieving superior performance on multiple datasets.

Authors:Bowen Zheng, Ran Cheng, Kay Chen Tan
Title: EvoRL: A GPU-accelerated Framework for Evolutionary Reinforcement Learning
Abstract:
Evolutionary Reinforcement Learning (EvoRL) has emerged as a promising approach to overcoming the limitations of traditional reinforcement learning (RL) by integrating the Evolutionary Computation (EC) paradigm with RL. However, the population-based nature of EC significantly increases computational costs, thereby restricting the exploration of algorithmic design choices and scalability in large-scale settings. To address this challenge, we introduce $\texttt{$\textbf{EvoRL}$}$, the first end-to-end EvoRL framework optimized for GPU acceleration. The framework executes the entire training pipeline on accelerators, including environment simulations and EC processes, leveraging hierarchical parallelism through vectorization and compilation techniques to achieve superior speed and scalability. This design enables the efficient training of large populations on a single machine. In addition to its performance-oriented design, $\texttt{$\textbf{EvoRL}$}$ offers a comprehensive platform for EvoRL research, encompassing implementations of traditional RL algorithms (e.g., A2C, PPO, DDPG, TD3, SAC), Evolutionary Algorithms (e.g., CMA-ES, OpenES, ARS), and hybrid EvoRL paradigms such as Evolutionary-guided RL (e.g., ERL, CEM-RL) and Population-Based AutoRL (e.g., PBT). The framework's modular architecture and user-friendly interface allow researchers to seamlessly integrate new components, customize algorithms, and conduct fair benchmarking and ablation studies. The project is open-source and available at: https://github.com/EMI-Group/evorl.
中文: EvoRL是首个面向进化强化学习的端到端GPU加速框架,通过分层并行实现高效大规模训练,同时提供模块化平台支持算法集成与研究。
English: EvoRL is the first GPU-accelerated end-to-end framework for Evolutionary Reinforcement Learning, enabling efficient large-scale training through hierarchical parallelism while providing a modular platform for algorithm integration and research.

Authors:Ziqi Liu
Title: FreqMoE: Enhancing Time Series Forecasting through Frequency Decomposition Mixture of Experts
Abstract:
Long-term time series forecasting is essential in areas like finance and weather prediction. Besides traditional methods that operate in the time domain, many recent models transform time series data into the frequency domain to better capture complex patterns. However, these methods often use filtering techniques to remove certain frequency signals as noise, which may unintentionally discard important information and reduce prediction accuracy. To address this, we propose the Frequency Decomposition Mixture-of-Experts (FreqMoE) model, which dynamically decomposes time series data into frequency bands, each processed by a specialized expert. A gating mechanism adjusts the importance of each output of expert based on frequency characteristics, and the aggregated results are fed into a prediction module that iteratively refines the forecast using residual connections. Our experiments demonstrate that FreqMoE outperforms state-of-the-art models, achieving the best performance on 51 out of 70 metrics across all tested datasets, while significantly reducing the number of required parameters to under 50k, providing notable efficiency advantages. Code is available at: https://github.com/sunbus100/FreqMoE-main
中文: FreqMoE模型通过动态分解时序数据至频段并由专家处理,在多数指标上超越现有最优模型,同时显著提升了预测精度与参数效率。
English: The FreqMoE model dynamically decomposes time series into frequency bands processed by specialized experts, outperforming state-of-the-art methods with superior accuracy and efficiency across multiple datasets.

Authors:Qingtian Bian, Marcus Vinícius de Carvalho, Tieying Li, Jiaxing Xu, Hui Fang, Yiping Ke
Title: ABXI: Invariant Interest Adaptation for Task-Guided Cross-Domain Sequential Recommendation
Abstract:
Cross-Domain Sequential Recommendation (CDSR) has recently gained attention for countering data sparsity by transferring knowledge across domains. A common approach merges domain-specific sequences into cross-domain sequences, serving as bridges to connect domains. One key challenge is to correctly extract the shared knowledge among these sequences and appropriately transfer it. Most existing works directly transfer unfiltered cross-domain knowledge rather than extracting domain-invariant components and adaptively integrating them into domain-specific modelings. Another challenge lies in aligning the domain-specific and cross-domain sequences. Existing methods align these sequences based on timestamps, but this approach can cause prediction mismatches when the current tokens and their targets belong to different domains. In such cases, the domain-specific knowledge carried by the current tokens may degrade performance. To address these challenges, we propose the A-B-Cross-to-Invariant Learning Recommender (ABXI). Specifically, leveraging LoRA's effectiveness for efficient adaptation, ABXI incorporates two types of LoRAs to facilitate knowledge adaptation. First, all sequences are processed through a shared encoder that employs a domain LoRA for each sequence, thereby preserving unique domain characteristics. Next, we introduce an invariant projector that extracts domain-invariant interests from cross-domain representations, utilizing an invariant LoRA to adapt these interests into modeling each specific domain. Besides, to avoid prediction mismatches, all domain-specific sequences are aligned to match the domains of the cross-domain ground truths. Experimental results on three datasets demonstrate that our approach outperforms other CDSR counterparts by a large margin. The codes are available in https://github.com/DiMarzioBian/ABXI.
中文: ABXI模型通过领域特定和不变LoRA模块提取并适配共享知识,同时对齐序列以避免预测失配,从而在跨域序列推荐中显著优于现有方法,并在多个数据集上验证了其优越性能。
English: The ABXI model addresses challenges in Cross-Domain Sequential Recommendation by using domain-specific and invariant LoRA modules to extract and adapt shared knowledge while aligning sequences to prevent prediction mismatches, achieving superior performance on multiple datasets.

Authors:Zihang Li, Yangdong Ruan, Wenjun Liu, Zhengyang Wang, Tong Yang
Title: CFT-RAG: An Entity Tree Based Retrieval Augmented Generation Algorithm With Cuckoo Filter
Abstract:
Although retrieval-augmented generation(RAG) significantly improves generation quality by retrieving external knowledge bases and integrating generated content, it faces computational efficiency bottlenecks, particularly in knowledge retrieval tasks involving hierarchical structures for Tree-RAG. This paper proposes a Tree-RAG acceleration method based on the improved Cuckoo Filter, which optimizes entity localization during the retrieval process to achieve significant performance improvements. Tree-RAG effectively organizes entities through the introduction of a hierarchical tree structure, while the Cuckoo Filter serves as an efficient data structure that supports rapid membership queries and dynamic updates. The experiment results demonstrate that our method is much faster than naive Tree-RAG while maintaining high levels of generative quality. When the number of trees is large, our method is hundreds of times faster than naive Tree-RAG. Our work is available at https://github.com/TUPYP7180/CFT-RAG-2025.
中文: 本文提出了一种基于改进布谷鸟过滤器的Tree-RAG加速方法,通过优化实体定位显著提升检索效率,在保持生成质量的同时,处理大规模树结构时速度可提升数百倍。
English: This paper introduces an acceleration method for Tree-RAG using an enhanced Cuckoo Filter, which significantly boosts retrieval efficiency by optimizing entity localization while preserving generation quality, achieving up to hundreds of times faster performance with large tree structures.

Authors:Jiayi Liao, Ruobing Xie, Sihang Li, Xiang Wang, Xingwu Sun, Zhanhui Kang, Xiangnan He
Title: Multi-Grained Patch Training for Efficient LLM-based Recommendation
Abstract:
Large Language Models (LLMs) have emerged as a new paradigm for recommendation by converting interacted item history into language modeling. However, constrained by the limited context length of LLMs, existing approaches have to truncate item history in the prompt, focusing only on recent interactions and sacrificing the ability to model long-term history. To enable LLMs to model long histories, we pursue a concise embedding representation for items and sessions. In the LLM embedding space, we construct an item's embedding by aggregating its textual token embeddings; similarly, we construct a session's embedding by aggregating its item embeddings. While efficient, this way poses two challenges since it ignores the temporal significance of user interactions and LLMs do not natively interpret our custom embeddings. To overcome these, we propose PatchRec, a multi-grained patch training method consisting of two stages: (1) Patch Pre-training, which familiarizes LLMs with aggregated embeddings -- patches, and (2) Patch Fine-tuning, which enables LLMs to capture time-aware significance in interaction history. Extensive experiments show that PatchRec effectively models longer behavior histories with improved efficiency. This work facilitates the practical use of LLMs for modeling long behavior histories. Codes are available at https://github.com/ljy0ustc/PatchRec.
中文摘要:PatchRec提出了一种多粒度补丁训练方法,通过为项目和会话创建简洁的嵌入表示,使大语言模型能够有效建模长用户行为历史,并通过预训练和微调两阶段解决时间重要性和自定义嵌入解释的挑战。
English Summary: PatchRec introduces a multi-grained patch training method to enable LLMs to effectively model long user behavior histories by creating concise embeddings for items and sessions, overcoming limitations of temporal significance and custom embedding interpretation through pre-training and fine-tuning stages.

Authors:Ryo Takizawa, Yoshiyuki Ohmura, Yasuo Kuniyoshi
Title: Gaze-Guided Task Decomposition for Imitation Learning in Robotic Manipulation
Abstract:
In imitation learning for robotic manipulation, decomposing object manipulation tasks into sub-tasks enables the reuse of learned skills and the combination of learned behaviors to perform novel tasks, rather than simply replicating demonstrated motions. Human gaze is closely linked to hand movements during object manipulation. We hypothesize that an imitating agent's gaze control, fixating on specific landmarks and transitioning between them, simultaneously segments demonstrated manipulations into sub-tasks. This study proposes a simple yet robust task decomposition method based on gaze transitions. Using teleoperation, a common modality in robotic manipulation for collecting demonstrations, in which a human operator's gaze is measured and used for task decomposition as a substitute for an imitating agent's gaze. Our approach ensures consistent task decomposition across all demonstrations for each task, which is desirable in contexts such as machine learning. We evaluated the method across demonstrations of various tasks, assessing the characteristics and consistency of the resulting sub-tasks. Furthermore, extensive testing across different hyperparameter settings confirmed its robustness, making it adaptable to diverse robotic systems. Our code is available at https://github.com/crumbyRobotics/GazeTaskDecomp.
中文摘要:本研究提出一种基于视线转移的简单而鲁棒的任务分解方法,通过测量人类操作者在遥操作演示中的注视点变化,将操作任务分解为一致性高的子任务,并在多种任务和参数设置中验证了其有效性。
English Summary: This study introduces a gaze-based method for decomposing robotic manipulation tasks into consistent sub-tasks using human operator gaze transitions during teleoperation, demonstrating robustness across various tasks and parameter settings.

Authors:Mengshi Qi, Xiaoyang Bi, Pengfei Zhu, Huadong Ma
Title: Towards Robust Unsupervised Attention Prediction in Autonomous Driving
Abstract:
Robustly predicting attention regions of interest for self-driving systems is crucial for driving safety but presents significant challenges due to the labor-intensive nature of obtaining large-scale attention labels and the domain gap between self-driving scenarios and natural scenes. These challenges are further exacerbated by complex traffic environments, including camera corruption under adverse weather, noise interferences, and central bias from long-tail distributions. To address these issues, we propose a robust unsupervised attention prediction method. An Uncertainty Mining Branch refines predictions by analyzing commonalities and differences across multiple pre-trained models on natural scenes, while a Knowledge Embedding Block bridges the domain gap by incorporating driving knowledge to adaptively enhance pseudo-labels. Additionally, we introduce RoboMixup, a novel data augmentation method that improves robustness against corruption through soft attention and dynamic augmentation, and mitigates central bias by integrating random cropping into Mixup as a regularizer. To systematically evaluate robustness in self-driving attention prediction, we introduce the DriverAttention-C benchmark, comprising over 100k frames across three subsets: BDD-A-C, DR(eye)VE-C, and DADA-2000-C. Our method achieves performance equivalent to or surpassing fully supervised state-of-the-art approaches on three public datasets and the proposed robustness benchmark, reducing relative corruption degradation by 58.8% and 52.8%, and improving central bias robustness by 12.4% and 11.4% in KLD and CC metrics, respectively. Code and data are available at https://github.com/zaplm/DriverAttention.
Chinese: 本研究提出了一种鲁棒的无监督自动驾驶注意力预测方法,通过不确定性挖掘和知识嵌入弥合领域差异,并采用RoboMixup数据增强提升抗干扰能力和缓解中心偏差,在新型DriverAttention-C基准测试中表现优于全监督方法。
English: This study introduces a robust unsupervised method for predicting attention in self-driving systems, utilizing uncertainty mining and knowledge embedding to bridge domain gaps and employing RoboMixup data augmentation to enhance corruption resistance and reduce central bias, validated by the new DriverAttention-C benchmark showing superior performance over supervised approaches.

Authors:Zhongpu Chen, Yinfeng Liu, Long Shi, Xingyan Chen, Yu Zhao, Fuji Ren
Title: MDEval: Evaluating and Enhancing Markdown Awareness in Large Language Models
Abstract:
Large language models (LLMs) are expected to offer structured Markdown responses for the sake of readability in web chatbots (e.g., ChatGPT). Although there are a myriad of metrics to evaluate LLMs, they fail to evaluate the readability from the view of output content structure. To this end, we focus on an overlooked yet important metric -- Markdown Awareness, which directly impacts the readability and structure of the content generated by these language models. In this paper, we introduce MDEval, a comprehensive benchmark to assess Markdown Awareness for LLMs, by constructing a dataset with 20K instances covering 10 subjects in English and Chinese. Unlike traditional model-based evaluations, MDEval provides excellent interpretability by combining model-based generation tasks and statistical methods. Our results demonstrate that MDEval achieves a Spearman correlation of 0.791 and an accuracy of 84.1% with human, outperforming existing methods by a large margin. Extensive experimental results also show that through fine-tuning over our proposed dataset, less performant open-source models are able to achieve comparable performance to GPT-4o in terms of Markdown Awareness. To ensure reproducibility and transparency, MDEval is open sourced at https://github.com/SWUFE-DB-Group/MDEval-Benchmark.
中文: 本文提出了MDEval基准,用于评估大语言模型的Markdown感知能力,显著提升了可读性和结构评估,实现了与人类判断的高度相关性,并通过微调使开源模型在Markdown感知方面达到与GPT-4o相当的性能。
English: This paper introduces MDEval, a benchmark designed to evaluate the Markdown Awareness of large language models, which significantly improves readability and structure assessment, achieving high correlation with human judgment and enabling open-source models to match GPT-4o's performance through fine-tuning.

Authors:Kaixun Jiang, Zhaoyu Chen, Jiyuan Fu, Lingyi Hong, Jinglun Li, Wenqiang Zhang
Title: VideoPure: Diffusion-based Adversarial Purification for Video Recognition
Abstract:
Recent work indicates that video recognition models are vulnerable to adversarial examples, posing a serious security risk to downstream applications. However, current research has primarily focused on adversarial attacks, with limited work exploring defense mechanisms. Furthermore, due to the spatial-temporal complexity of videos, existing video defense methods face issues of high cost, overfitting, and limited defense performance. Recently, diffusion-based adversarial purification methods have achieved robust defense performance in the image domain. However, due to the additional temporal dimension in videos, directly applying these diffusion-based adversarial purification methods to the video domain suffers performance and efficiency degradation. To achieve an efficient and effective video adversarial defense method, we propose the first diffusion-based video purification framework to improve video recognition models' adversarial robustness: VideoPure. Given an adversarial example, we first employ temporal DDIM inversion to transform the input distribution into a temporally consistent and trajectory-defined distribution, covering adversarial noise while preserving more video structure. Then, during DDIM denoising, we leverage intermediate results at each denoising step and conduct guided spatial-temporal optimization, removing adversarial noise while maintaining temporal consistency. Finally, we input the list of optimized intermediate results into the video recognition model for multi-step voting to obtain the predicted class. We investigate the defense performance of our method against black-box, gray-box, and adaptive attacks on benchmark datasets and models. Compared with other adversarial purification methods, our method overall demonstrates better defense performance against different attacks. Our code is available at https://github.com/deep-kaixun/VideoPure.
中文摘要:本研究提出了VideoPure,首个基于扩散的视频净化框架,通过时序DDIM反演和引导式时空优化,在保持时序一致性的同时有效去除对抗噪声,显著提升了视频识别模型对抗不同攻击的防御性能。
English Summary: The study introduces VideoPure, a novel diffusion-based video purification framework designed to enhance adversarial robustness in video recognition models by employing temporal DDIM inversion and guided spatial-temporal optimization for effective noise removal while maintaining temporal consistency.

Authors:Bao Duong, Sunil Gupta, Thin Nguyen
Title: Causal Discovery via Bayesian Optimization
Abstract:
Existing score-based methods for directed acyclic graph (DAG) learning from observational data struggle to recover the causal graph accurately and sample-efficiently. To overcome this, in this study, we propose DrBO (DAG recovery via Bayesian Optimization)-a novel DAG learning framework leveraging Bayesian optimization (BO) to find high-scoring DAGs. We show that, by sophisticatedly choosing the promising DAGs to explore, we can find higher-scoring ones much more efficiently. To address the scalability issues of conventional BO in DAG learning, we replace Gaussian Processes commonly employed in BO with dropout neural networks, trained in a continual manner, which allows for (i) flexibly modeling the DAG scores without overfitting, (ii) incorporation of uncertainty into the estimated scores, and (iii) scaling with the number of evaluations. As a result, DrBO is computationally efficient and can find the accurate DAG in fewer trials and less time than existing state-of-the-art methods. This is demonstrated through an extensive set of empirical evaluations on many challenging settings with both synthetic and real data. Our implementation is available at https://github.com/baosws/DrBO.
中文: 本研究提出DrBO,一种利用贝叶斯优化和dropout神经网络的新型有向无环图学习框架,能高效准确地从观测数据中恢复因果图,在速度和精度上均优于现有方法。
English: This study introduces DrBO, a novel DAG learning framework that uses Bayesian optimization with dropout neural networks to efficiently and accurately recover causal graphs from observational data, outperforming existing methods in both speed and accuracy.

Authors:Hongbo Zheng, Suyuan Wang, Neeraj Gangwar, Nickvash Kani
Title: E-Gen: Leveraging E-Graphs to Improve Continuous Representations of Symbolic Expressions
Abstract:
Vector representations have been pivotal in advancing natural language processing (NLP), with prior research focusing on embedding techniques for mathematical expressions using mathematically equivalent formulations. While effective, these approaches are constrained by the size and diversity of training data. In this work, we address these limitations by introducing E-Gen, a novel e-graph-based dataset generation scheme that synthesizes large and diverse mathematical expression datasets, surpassing prior methods in size and operator variety. Leveraging this dataset, we train embedding models using two strategies: (1) generating mathematically equivalent expressions, and (2) contrastive learning to explicitly group equivalent expressions. We evaluate these embeddings on both in-distribution and out-of-distribution mathematical language processing tasks, comparing them against prior methods. Finally, we demonstrate that our embedding-based approach outperforms state-of-the-art large language models (LLMs) on several tasks, underscoring the necessity of optimizing embedding methods for the mathematical data modality. The source code and datasets are available at https://github.com/MLPgroup/E-Gen.
中文: 本研究提出E-Gen,一种基于e-图的数据生成方法,能创建大规模多样化的数学表达式数据集,用于训练嵌入模型,在多项数学处理任务中超越了现有技术和大型语言模型的性能。
English: This study introduces E-Gen, an e-graph-based method that generates large, diverse mathematical expression datasets to train embedding models, which outperform existing techniques and large language models on various mathematical processing tasks.

Authors:Qing Wang, Wen-jie Chen, Bo Li, Jing Su, Guangyu Wang, Qianqian Song
Title: HECLIP: Histology-Enhanced Contrastive Learning for Imputation of Transcriptomics Profiles
Abstract:
Histopathology, particularly hematoxylin and eosin (H\&E) staining, plays a critical role in diagnosing and characterizing pathological conditions by highlighting tissue morphology. However, H\&E-stained images inherently lack molecular information, requiring costly and resource-intensive methods like spatial transcriptomics to map gene expression with spatial resolution. To address these challenges, we introduce HECLIP (Histology-Enhanced Contrastive Learning for Imputation of Profiles), an innovative deep learning framework that bridges the gap between histological imaging and molecular profiling. HECLIP is specifically designed to infer gene expression profiles directly from H\&E-stained images, eliminating the need for expensive spatial transcriptomics assays. HECLIP leverages an advanced image-centric contrastive loss function to optimize image representation learning, ensuring that critical morphological patterns in histology images are effectively captured and translated into accurate gene expression profiles. This design enhances the predictive power of the image modality while minimizing reliance on gene expression data. Through extensive benchmarking on publicly available datasets, HECLIP demonstrates superior performance compared to existing approaches, delivering robust and biologically meaningful predictions. Detailed ablation studies further underscore its effectiveness in extracting molecular insights from histology images. Additionally, HECLIP's scalable and cost-efficient approach positions it as a transformative tool for both research and clinical applications, driving advancements in precision medicine. The source code for HECLIP is openly available at https://github.com/QSong-github/HECLIP.
中文: HECLIP是一种创新的深度学习框架,可直接从H&E染色组织学图像中精确推断基因表达谱,无需昂贵的空间转录组学检测,并通过广泛基准测试展现出卓越性能。
English: HECLIP is a novel deep learning framework that accurately infers gene expression profiles directly from H&E-stained histology images, eliminating the need for costly spatial transcriptomics while demonstrating superior performance through extensive benchmarking.

Authors:Md. Kamrul Hasan, Guang Yang, Choon Hwai Yap
Title: Motion-enhanced Cardiac Anatomy Segmentation via an Insertable Temporal Attention Module
Abstract:
Cardiac anatomy segmentation is useful for clinical assessment of cardiac morphology to inform diagnosis and intervention. Deep learning (DL), especially with motion information, has improved segmentation accuracy. However, existing techniques for motion enhancement are not yet optimal, and they have high computational costs due to increased dimensionality or reduced robustness due to suboptimal approaches that use non-DL motion registration, non-attention models, or single-headed attention. They further have limited adaptability and are inconvenient for incorporation into existing networks where motion awareness is desired. Here, we propose a novel, computationally efficient Temporal Attention Module (TAM) that offers robust motion enhancement, modeled as a small, multi-headed, cross-temporal attention module. TAM's uniqueness is that it is a lightweight, plug-and-play module that can be inserted into a broad range of segmentation networks (CNN-based, Transformer-based, or hybrid) for motion enhancement without requiring substantial changes in the network's backbone. This feature enables high adaptability and ease of integration for enhancing both existing and future networks. Extensive experiments on multiple 2D and 3D cardiac ultrasound and MRI datasets confirm that TAM consistently improves segmentation across a range of networks while maintaining computational efficiency and improving on currently reported performance. The evidence demonstrates that it is a robust, generalizable solution for motion-awareness enhancement that is scalable (such as from 2D to 3D).
中文: 该摘要提出了一种新型时序注意力模块(TAM),能高效增强多种网络架构的运动感知心脏分割性能,大量实验证实其具有更优的运算效率和泛化能力。
English: This abstract introduces a novel Temporal Attention Module (TAM) that efficiently enhances motion-aware cardiac segmentation across various network architectures, demonstrating improved performance and computational efficiency in extensive experiments.

Authors:Taewoong Lee, Sarah Frisken, Nazim Haouchine
Title: 3D/2D Registration of Angiograms using Silhouette-based Differentiable Rendering
Abstract:
We present a method for 3D/2D registration of Digital Subtraction Angiography (DSA) images to provide valuable insight into brain hemodynamics and angioarchitecture. Our approach formulates the registration as a pose estimation problem, leveraging both anteroposterior and lateral DSA views and employing differentiable rendering. Preliminary experiments on real and synthetic datasets demonstrate the effectiveness of our method, with both qualitative and quantitative evaluations highlighting its potential for clinical applications. The code is available at https://github.com/taewoonglee17/TwoViewsDSAReg.
中文: 本研究提出了一种利用双视角输入和可微分渲染的3D/2D DSA图像配准方法,实验结果表明其在临床应用中的潜力。
English: This study introduces a 3D/2D registration method for DSA images using pose estimation with dual-view inputs and differentiable rendering, showing promising results in experiments for clinical use.

Authors:Juan Ramirez, Ignacio Hounie, Juan Elenter, Jose Gallego-Posada, Meraj Hashemizadeh, Alejandro Ribeiro, Simon Lacoste-Julien
Title: Feasible Learning
Abstract:
We introduce Feasible Learning (FL), a sample-centric learning paradigm where models are trained by solving a feasibility problem that bounds the loss for each training sample. In contrast to the ubiquitous Empirical Risk Minimization (ERM) framework, which optimizes for average performance, FL demands satisfactory performance on every individual data point. Since any model that meets the prescribed performance threshold is a valid FL solution, the choice of optimization algorithm and its dynamics play a crucial role in shaping the properties of the resulting solutions. In particular, we study a primal-dual approach which dynamically re-weights the importance of each sample during training. To address the challenge of setting a meaningful threshold in practice, we introduce a relaxation of FL that incorporates slack variables of minimal norm. Our empirical analysis, spanning image classification, age regression, and preference optimization in large language models, demonstrates that models trained via FL can learn from data while displaying improved tail behavior compared to ERM, with only a marginal impact on average performance.
Chinese: 可行学习(FL)是一种以样本为中心的学习范式,通过限制每个数据点的损失来训练模型,相较于经验风险最小化,它在保持平均性能的同时显著改善了尾部表现。
English: Feasible Learning (FL) is a sample-centric paradigm that trains models by bounding the loss for each individual data point, leading to improved tail performance with minimal impact on average results compared to Empirical Risk Minimization.

Authors:Michael K. Chen, Xikun Zhang, Dacheng Tao
Title: JustLogic: A Comprehensive Benchmark for Evaluating Deductive Reasoning in Large Language Models
Abstract:
Logical reasoning is a critical component of Large Language Models (LLMs), and substantial research efforts in recent years have aimed to enhance their deductive reasoning capabilities. However, existing deductive reasoning benchmarks, which are crucial for evaluating and advancing LLMs, are inadequate due to their lack of task complexity, presence of prior knowledge as a confounder, and superficial error analysis. To address these deficiencies, we introduce JustLogic, a synthetically generated deductive reasoning benchmark designed for rigorous evaluation of LLMs. JustLogic is (i) highly complex, capable of generating a diverse range of linguistic patterns, vocabulary, and argument structures; (ii) prior knowledge independent, eliminating the advantage of models possessing prior knowledge and ensuring that only deductive reasoning is used to answer questions; and (iii) capable of in-depth error analysis on the heterogeneous effects of reasoning depth and argument form on model accuracy. Our experimental results on JustLogic reveal that (i) state-of-the-art (SOTA) reasoning LLMs perform on par or better than the human average but significantly worse than the human ceiling, and (ii) SOTA non-reasoning models still underperform the human average. All code and data are available at https://github.com/michaelchen-lab/JustLogic
中文摘要:JustLogic基准通过提供更高的复杂性、独立于先验知识以及深入错误分析能力,解决了现有大语言模型演绎推理评估的不足,实验表明推理型大语言模型仅达到人类平均水平,远未达到人类最佳水平。
English Summary: The JustLogic benchmark addresses limitations in existing deductive reasoning evaluations for LLMs by offering enhanced complexity, independence from prior knowledge, and detailed error analysis capabilities, revealing that while reasoning LLMs match average human performance, they fall short of peak human ability.

Authors:Libo Wang
Title: Wormhole Memory: A Rubik's Cube for Cross-Dialogue Retrieval
Abstract:
In view of the gap in the current large language model in sharing memory across dialogues, this research proposes a wormhole memory module (WMM) to realize memory as a Rubik's cube that can be arbitrarily retrieved between different dialogues. Through simulation experiments, the researcher built an experimental framework based on the Python environment and used setting memory barriers to simulate the current situation where memories between LLMs dialogues are difficult to share. The CoQA development data set was imported into the experiment, and the feasibility of its cross-dialogue memory retrieval function was verified for WMM's nonlinear indexing and dynamic retrieval, and a comparative analysis was conducted with the capabilities of Titans and MemGPT memory modules. Experimental results show that WMM demonstrated the ability to retrieve memory across dialogues and the stability of quantitative indicators in eight experiments. It contributes new technical approaches to the optimization of memory management of LLMs and provides experience for the practical application in the future.
Chinese: 本研究提出了一种虫洞记忆模块(WMM),实现了大语言模型跨对话记忆检索功能,实验验证了其可行性和稳定性,为优化记忆管理提供了新的技术途径。
English: This research introduces a wormhole memory module (WMM) that enables cross-dialogue memory retrieval in large language models, demonstrating its feasibility and stability through experiments and offering new technical approaches for memory management optimization.

Authors:Xin Zhou, Dingkang Liang, Sifan Tu, Xiwu Chen, Yikang Ding, Dingyuan Zhang, Feiyang Tan, Hengshuang Zhao, Xiang Bai
Title: HERMES: A Unified Self-Driving World Model for Simultaneous 3D Scene Understanding and Generation
Abstract:
Driving World Models (DWMs) have become essential for autonomous driving by enabling future scene prediction. However, existing DWMs are limited to scene generation and fail to incorporate scene understanding, which involves interpreting and reasoning about the driving environment. In this paper, we present a unified Driving World Model named HERMES. We seamlessly integrate 3D scene understanding and future scene evolution (generation) through a unified framework in driving scenarios. Specifically, HERMES leverages a Bird's-Eye View (BEV) representation to consolidate multi-view spatial information while preserving geometric relationships and interactions. We also introduce world queries, which incorporate world knowledge into BEV features via causal attention in the Large Language Model, enabling contextual enrichment for understanding and generation tasks. We conduct comprehensive studies on nuScenes and OmniDrive-nuScenes datasets to validate the effectiveness of our method. HERMES achieves state-of-the-art performance, reducing generation error by 32.4% and improving understanding metrics such as CIDEr by 8.0%. The model and code will be publicly released at https://github.com/LMD0311/HERMES.
Chinese: HERMES提出了一种统一的驾驶世界模型,通过鸟瞰图表示和世界查询将3D场景理解与未来场景生成相结合,在生成误差降低32.4%和理解指标提升8.0%的基础上实现了最先进的性能。
English: HERMES introduces a unified Driving World Model that integrates 3D scene understanding and future scene generation using Bird's-Eye View representation and world queries, achieving state-of-the-art performance with a 32.4% reduction in generation error and 8.0% improvement in understanding metrics.

Authors:Naihao Deng, Rada Mihalcea
Title: Rethinking Table Instruction Tuning
Abstract:
Recent advances in table understanding have focused on instruction-tuning large language models (LLMs) for table-related tasks. However, existing research has overlooked the impact of hyperparameter choices, and also lacks a comprehensive evaluation of the out-of-domain table understanding ability and the general capabilities of these table LLMs. In this paper, we evaluate these abilities in existing table LLMs, and find significant declines in both out-of-domain table understanding and general capabilities as compared to their base models. Through systematic analysis, we show that hyperparameters, such as learning rate, can significantly influence both table-specific and general capabilities. Contrary to the previous table instruction-tuning work, we demonstrate that smaller learning rates and fewer training instances can enhance table understanding while preserving general capabilities. Based on our findings, we introduce TAMA, a TAble LLM instruction-tuned from LLaMA 3.1 8B Instruct, which achieves performance on par with, or surpassing GPT-3.5 and GPT-4 on table tasks, while maintaining strong out-of-domain generalization and general capabilities. Our findings highlight the potential for reduced data annotation costs and more efficient model development through careful hyperparameter selection. We open-source the project and our models.
当前表格理解研究忽视了超参数选择和全面评估,导致领域外理解和通用能力显著下降;我们提出的TAMA模型通过优化训练参数,在保持通用能力的同时实现优异性能,甚至超越GPT系列模型,为降低标注成本、提升模型效率提供了新路径。
Recent advances in table understanding through instruction-tuned LLMs overlook hyperparameter impacts and comprehensive evaluation, revealing performance declines in out-of-domain and general capabilities, which our method TAMA addresses with optimized training to match or exceed leading models while preserving versatility.

Authors:Rongzhao He, Weihao Zheng, Leilei Zhao, Ying Wang, Dalin Zhu, Dan Wu, Bin Hu
Title: Surface Vision Mamba: Leveraging Bidirectional State Space Model for Efficient Spherical Manifold Representation
Abstract:
Attention-based methods have demonstrated exceptional performance in modelling long-range dependencies on spherical cortical surfaces, surpassing traditional Geometric Deep Learning (GDL) models. However, their extensive inference time and high memory demands pose challenges for application to large datasets with limited computing resources. Inspired by the state space model in computer vision, we introduce the attention-free Vision Mamba (Vim) to spherical surfaces, presenting a domain-agnostic architecture for analyzing data on spherical manifolds. Our method achieves surface patching by representing spherical data as a sequence of triangular patches derived from a subdivided icosphere. The proposed Surface Vision Mamba (SiM) is evaluated on multiple neurodevelopmental phenotype regression tasks using cortical surface metrics from neonatal brains. Experimental results demonstrate that SiM outperforms both attention- and GDL-based methods, delivering 4.8 times faster inference and achieving 91.7% lower memory consumption compared to the Surface Vision Transformer (SiT) under the Ico-4 grid partitioning. Sensitivity analysis further underscores the potential of SiM to identify subtle cognitive developmental patterns. The code is available at https://github.com/Rongzhao-He/surface-vision-mamba.
Chinese: 提出的表面视觉Mamba(SiM)方法在神经发育分析中超越了基于注意力和几何深度学习的方法,实现了推理速度提升4.8倍、内存消耗降低91.7%,并能有效识别细微的认知发育模式。
English: The proposed Surface Vision Mamba (SiM) method outperforms attention-based and geometric deep learning approaches in neurodevelopmental analysis by achieving 4.8x faster inference and 91.7% lower memory consumption while effectively identifying subtle cognitive patterns.

Authors:Yixing Jiang, Kameron C. Black, Gloria Geng, Danny Park, James Zou, Andrew Y. Ng, Jonathan H. Chen
Title: MedAgentBench: A Realistic Virtual EHR Environment to Benchmark Medical LLM Agents
Abstract:
Recent large language models (LLMs) have demonstrated significant advancements, particularly in their ability to serve as agents thereby surpassing their traditional role as chatbots. These agents can leverage their planning and tool utilization capabilities to address tasks specified at a high level. However, a standardized dataset to benchmark the agent capabilities of LLMs in medical applications is currently lacking, making the evaluation of LLMs on complex tasks in interactive healthcare environments challenging. To address this gap, we introduce MedAgentBench, a broad evaluation suite designed to assess the agent capabilities of large language models within medical records contexts. MedAgentBench encompasses 300 patient-specific clinically-derived tasks from 10 categories written by human physicians, realistic profiles of 100 patients with over 700,000 data elements, a FHIR-compliant interactive environment, and an accompanying codebase. The environment uses the standard APIs and communication infrastructure used in modern EMR systems, so it can be easily migrated into live EMR systems. MedAgentBench presents an unsaturated agent-oriented benchmark that current state-of-the-art LLMs exhibit some ability to succeed at. The best model (Claude 3.5 Sonnet v2) achieves a success rate of 69.67%. However, there is still substantial space for improvement which gives the community a next direction to optimize. Furthermore, there is significant variation in performance across task categories. MedAgentBench establishes this and is publicly available at https://github.com/stanfordmlgroup/MedAgentBench , offering a valuable framework for model developers to track progress and drive continuous improvements in the agent capabilities of large language models within the medical domain.
中文: 近期大语言模型已从聊天机器人发展为智能体,但缺乏医疗领域的标准化评测基准,为此我们推出MedAgentBench——一个包含临床任务和真实患者数据的综合评估平台,旨在衡量并提升大语言模型在医疗环境中的智能体能力。
English: Recent large language models have advanced beyond chatbots to act as agents, yet lack standardized medical benchmarks, prompting the introduction of MedAgentBench—a comprehensive evaluation suite with clinically derived tasks and realistic patient data to assess and improve LLM agent capabilities in healthcare.

Authors:Jiazhen Zhang, Yuexi Du, Nicha C. Dvornek, John A. Onofrey
Title: Improved Vessel Segmentation with Symmetric Rotation-Equivariant U-Net
Abstract:
Automated segmentation plays a pivotal role in medical image analysis and computer-assisted interventions. Despite the promising performance of existing methods based on convolutional neural networks (CNNs), they neglect useful equivariant properties for images, such as rotational and reflection equivariance. This limitation can decrease performance and lead to inconsistent predictions, especially in applications like vessel segmentation where explicit orientation is absent. While existing equivariant learning approaches attempt to mitigate these issues, they substantially increase learning cost, model size, or both. To overcome these challenges, we propose a novel application of an efficient symmetric rotation-equivariant (SRE) convolutional (SRE-Conv) kernel implementation to the U-Net architecture, to learn rotation and reflection-equivariant features, while also reducing the model size dramatically. We validate the effectiveness of our method through improved segmentation performance on retina vessel fundus imaging. Our proposed SRE U-Net not only significantly surpasses standard U-Net in handling rotated images, but also outperforms existing equivariant learning methods and does so with a reduced number of trainable parameters and smaller memory cost. The code is available at https://github.com/OnofreyLab/sre_conv_segm_isbi2025.
中文摘要:本研究提出的SRE U-Net在U-Net架构中引入了高效的对称旋转等变卷积核,不仅显著提升了医学图像旋转分割性能,还大幅减少了模型参数量和计算成本,优于现有等变学习方法。
English Summary: The proposed SRE U-Net introduces an efficient symmetric rotation-equivariant convolutional kernel to the U-Net architecture, achieving superior segmentation performance on rotated medical images while significantly reducing model size and computational costs compared to existing methods.

Authors:Panisara Meehinkong, Donlapark Ponnoprat
Title: coverforest: Conformal Predictions with Random Forest in Python
Abstract:
Conformal prediction provides a framework for uncertainty quantification, specifically in the forms of prediction intervals and sets with distribution-free guaranteed coverage. While recent cross-conformal techniques such as CV+ and Jackknife+-after-bootstrap achieve better data efficiency than traditional split conformal methods, they incur substantial computational costs due to required pairwise comparisons between training and test samples' out-of-bag scores. Observing that these methods naturally extend from ensemble models, particularly random forests, we leverage existing optimized random forest implementations to enable efficient cross-conformal predictions. We present coverforest, a Python package that implements efficient conformal prediction methods specifically optimized for random forests. coverforest supports both regression and classification tasks through various conformal prediction methods, including split conformal, CV+, Jackknife+-after-bootstrap, and adaptive prediction sets. Our package leverages parallel computing and Cython optimizations to speed up out-of-bag calculations. Our experiments demonstrate that coverforest's predictions achieve the desired level of coverage. In addition, its training and prediction times can be faster than an existing implementation by 2--9 times. The source code for the coverforest is hosted on GitHub at https://github.com/donlapark/coverforest.
中文: coverforest Python 包针对随机森林优化实现了高效共形预测方法,通过并行计算和Cython优化,在保证覆盖精度的同时将训练预测速度提升2-9倍。
English: The coverforest Python package efficiently implements conformal prediction methods optimized for random forests, achieving desired coverage with 2-9 times faster performance through parallel computing and Cython optimizations.

Authors:Haifeng Wen, Hong Xing, Osvaldo Simeone
Title: Distributed Conformal Prediction via Message Passing
Abstract:
Post-hoc calibration of pre-trained models is critical for ensuring reliable inference, especially in safety-critical domains such as healthcare. Conformal Prediction (CP) offers a robust post-hoc calibration framework, providing distribution-free statistical coverage guarantees for prediction sets by leveraging held-out datasets. In this work, we address a decentralized setting where each device has limited calibration data and can communicate only with its neighbors over an arbitrary graph topology. We propose two message-passing-based approaches for achieving reliable inference via CP: quantile-based distributed conformal prediction (Q-DCP) and histogram-based distributed conformal prediction (H-DCP). Q-DCP employs distributed quantile regression enhanced with tailored smoothing and regularization terms to accelerate convergence, while H-DCP uses a consensus-based histogram estimation approach. Through extensive experiments, we investigate the trade-offs between hyperparameter tuning requirements, communication overhead, coverage guarantees, and prediction set sizes across different network topologies. The code of our work is released on: https://github.com/HaifengWen/Distributed-Conformal-Prediction.
Chinese: 本研究提出了Q-DCP和H-DCP两种分布式共形预测方法,使数据有限的设备能通过任意图拓扑与邻居通信协作,实现具有统计保证的可靠推理。
English: This study introduces two distributed conformal prediction methods, Q-DCP and H-DCP, which enable devices with limited calibration data to collaboratively achieve reliable inference with statistical guarantees through neighbor communication over arbitrary graph topologies.

Authors:Wenzhang Liu, Lianjun Jin, Lu Ren, Chaoxu Mu, Changyin Sun
Title: Reducing Action Space for Deep Reinforcement Learning via Causal Effect Estimation
Abstract:
Intelligent decision-making within large and redundant action spaces remains challenging in deep reinforcement learning. Considering similar but ineffective actions at each step can lead to repetitive and unproductive trials. Existing methods attempt to improve agent exploration by reducing or penalizing redundant actions, yet they fail to provide quantitative and reliable evidence to determine redundancy. In this paper, we propose a method to improve exploration efficiency by estimating the causal effects of actions. Unlike prior methods, our approach offers quantitative results regarding the causality of actions for one-step transitions. We first pre-train an inverse dynamics model to serve as prior knowledge of the environment. Subsequently, we classify actions across the entire action space at each time step and estimate the causal effect of each action to suppress redundant actions during exploration. We provide a theoretical analysis to demonstrate the effectiveness of our method and present empirical results from simulations in environments with redundant actions to evaluate its performance. Our implementation is available at https://github.com/agi-brain/cee.git.
Chinese: 本文提出了一种通过估计动作的因果效应来提高深度强化学习中探索效率的方法,该方法能定量识别并抑制冗余动作,从而解决大动作空间中的决策挑战。
English: This paper introduces a method to enhance exploration efficiency in deep reinforcement learning by estimating the causal effects of actions, which quantitatively identifies and suppresses redundant actions to address challenges in large action spaces.

Authors:Fanxing Li, Fangyu Sun, Tianbao Zhang, Danping Zou
Title: ABPT: Amended Backpropagation through Time with Partially Differentiable Rewards
Abstract:
Quadrotor control policies can be trained with high performance using the exact gradients of the rewards to directly optimize policy parameters via backpropagation-through-time (BPTT). However, designing a fully differentiable reward architecture is often challenging. Partially differentiable rewards will result in biased gradient propagation that degrades training performance. To overcome this limitation, we propose Amended Backpropagation-through-Time (ABPT), a novel approach that mitigates gradient bias while preserving the training efficiency of BPTT. ABPT combines 0-step and N-step returns, effectively reducing the bias by leveraging value gradients from the learned Q-value function. Additionally, it adopts entropy regularization and state initialization mechanisms to encourage exploration during training. We evaluate ABPT on four representative quadrotor flight tasks \li{in both real world and simulation}. Experimental results demonstrate that ABPT converges significantly faster and achieves higher ultimate rewards than existing learning algorithms, particularly in tasks involving partially differentiable rewards. The code will be released at http://github.com/Fanxing-LI/ABPT.
中文: 提出的修正时间反向传播(ABPT)方法通过结合0步与N步回报及熵正则化,有效缓解了四旋翼控制训练中的梯度偏差,在部分可微奖励场景下比现有算法收敛更快且获得更高回报。
English: The proposed Amended Backpropagation-through-Time (ABPT) method effectively mitigates gradient bias in quadrotor control training by combining 0-step and N-step returns with entropy regularization, achieving faster convergence and higher rewards than existing algorithms, especially with partially differentiable rewards.

Authors:Jia Yu, Fei Yuan, Rui Min, Jing Yu, Pei Chu, Jiayang Li, Wei Li, Ruijie Zhang, Zhenxiang Li, Zhifei Ren, Dong Zheng, Wenjian Zhang, Yan Teng, Lingyu Meng, ZhenJiang Jin, Jiantao Qiu, ShaSha Wang, Zhongying Tu, Dahua Lin, Yu Wang, Yu Qiao, Yanfeng Wang, Conghui He
Title: WanJuanSiLu: A High-Quality Open-Source Webtext Dataset for Low-Resource Languages
Abstract:
This paper introduces the open-source dataset WanJuanSiLu, designed to provide high-quality training corpora for low-resource languages, thereby advancing the research and development of multilingual models. To achieve this, we have developed a systematic data processing framework tailored for low-resource languages. This framework encompasses key stages such as data extraction, corpus cleaning, content deduplication, security filtering, quality evaluation, and theme classification. Through the implementation of this framework, we have significantly improved both the quality and security of the dataset, while maintaining its linguistic diversity. As of now, data for all five languages have been fully open-sourced. The dataset can be accessed at https://opendatalab.com/applyMultilingualCorpus, and GitHub repository is available at https://github.com/opendatalab/WanJuan3.0
中文: 本文介绍了开源数据集万卷丝路,旨在通过系统化的数据处理框架为低资源语言提供高质量训练语料,提升多语言模型研发,目前五种语言数据已全面开源并可在指定平台获取。
English: This paper presents the open-source WanJuanSiLu dataset, designed to enhance multilingual model development for low-resource languages through a systematic data processing framework that ensures quality, safety, and linguistic diversity, with all five languages' data now fully available online.

Authors:Jie He, Yijun Yang, Wanqiu Long, Deyi Xiong, Victor Gutierrez-Basulto, Jeff Z. Pan
Title: Evaluating and Improving Graph to Text Generation with Large Language Models
Abstract:
Large language models (LLMs) have demonstrated immense potential across various tasks. However, research for exploring and improving the capabilities of LLMs in interpreting graph structures remains limited. To address this gap, we conduct a comprehensive evaluation of prompting current open-source LLMs on graph-to-text generation tasks. Although we explored the optimal prompting strategies and proposed a novel and effective diversity-difficulty-based few-shot sample selection method, we found that the improvements from tuning-free approaches were incremental, as LLMs struggle with planning on complex graphs, particularly those with a larger number of triplets. To further improve LLMs in planning with graph sequences and grounding in truth, we introduce a new graph-to-text dataset, PlanGTG, annotated with two sub-tasks: reordering and attribution. Through extensive automatic and human evaluations, we demonstrate significant improvements in the quality of generated text from both few-shot learning and fine-tuning perspectives using the PlanGTG dataset. Our study paves the way for new research directions in graph-to-text generation. PlanGTG datasets can be found in https://github.com/probe2/kg_text.
Chinese: 大型语言模型在图表转文本任务中潜力巨大但面临挑战,为此我们开发了PlanGTG数据集,通过重排序和归因标注显著提升了文本生成质量。
English: Large language models show potential but face challenges in graph-to-text tasks, leading to the creation of the PlanGTG dataset which significantly improves text generation through reordering and attribution annotations.

Authors:Zhengyang Tang, Ziniu Li, Zhenyang Xiao, Tian Ding, Ruoyu Sun, Benyou Wang, Dayiheng Liu, Fei Huang, Tianyu Liu, Bowen Yu, Junyang Lin
Title: RealCritic: Towards Effectiveness-Driven Evaluation of Language Model Critiques
Abstract:
Critiques are important for enhancing the performance of Large Language Models (LLMs), enabling both self-improvement and constructive feedback for others by identifying flaws and suggesting improvements. However, evaluating the critique capabilities of LLMs presents a significant challenge due to the open-ended nature of the task. In this work, we introduce a new benchmark designed to assess the critique capabilities of LLMs. Unlike existing benchmarks, which typically function in an open-loop fashion, our approach employs a closed-loop methodology that evaluates the quality of corrections generated from critiques. Moreover, the benchmark incorporates features such as self-critique, cross-critique, and iterative critique, which are crucial for distinguishing the abilities of advanced reasoning models from more classical ones. We implement this benchmark using eight challenging reasoning tasks. We have several interesting findings. First, despite demonstrating comparable performance in direct chain-of-thought generation, classical LLMs significantly lag behind the advanced reasoning-based model o1-mini across all critique scenarios. Second, in self-critique and iterative critique settings, classical LLMs may even underperform relative to their baseline capabilities. We hope that this benchmark will serve as a valuable resource to guide future advancements. The code and data are available at \url{https://github.com/tangzhy/RealCritic}.
Chinese: 本文提出了一种闭环基准来评估大语言模型的批判能力,发现先进推理模型在自我批判和迭代场景中优于传统模型,而传统模型有时甚至表现低于基准水平。
English: This paper introduces a closed-loop benchmark to evaluate LLMs' critique capabilities, revealing that advanced reasoning models outperform classical LLMs in self-critique and iterative scenarios, with classical models sometimes regressing below baseline performance.

Authors:Jake McLaughlin, Nicholas Charron, Sriram Narasimhan
Title: Visual-Lidar Map Alignment for Infrastructure Inspections
Abstract:
Routine and repetitive infrastructure inspections present safety, efficiency, and consistency challenges as they are performed manually, often in challenging or hazardous environments. They can also introduce subjectivity and errors into the process, resulting in undesirable outcomes. Simultaneous localization and mapping (SLAM) presents an opportunity to generate high-quality 3D maps that can be used to extract accurate and objective inspection data. Yet, many SLAM algorithms are limited in their ability to align 3D maps from repeated inspections in GPS-denied settings automatically. This limitation hinders practical long-term asset health assessments by requiring tedious manual alignment for data association across scans from previous inspections. This paper introduces a versatile map alignment algorithm leveraging both visual and lidar data for improved place recognition robustness and presents an infrastructure-focused dataset tailored for consecutive inspections. By detaching map alignment from SLAM, our approach enhances infrastructure inspection pipelines, supports monitoring asset degradation over time, and invigorates SLAM research by permitting exploration beyond existing multi-session SLAM algorithms.
中文: 人工基础设施检测存在安全与精度问题,本文提出一种结合视觉与激光雷达的通用地图对齐算法,能实现自动化检测并支持长期资产状态监测。
English: Manual infrastructure inspections face safety and accuracy issues, but this paper introduces a versatile map alignment algorithm using visual and lidar data to automate the process and enable long-term asset monitoring.

Authors:Xu Chu, Zhijie Tan, Hanlin Xue, Guanyu Wang, Tong Mo, Weiping Li
Title: Domaino1s: Guiding LLM Reasoning for Explainable Answers in High-Stakes Domains
Abstract:
Large Language Models (LLMs) are widely applied to downstream domains. However, current LLMs for high-stakes domain tasks, such as financial investment and legal QA, typically generate brief answers without reasoning processes and explanations. This limits users' confidence in making decisions based on their responses. While original CoT shows promise, it lacks self-correction mechanisms during reasoning. This work introduces Domain$o1$s, which enhances LLMs' reasoning capabilities on domain tasks through supervised fine-tuning and tree search. We construct CoT-stock-2k and CoT-legal-2k datasets for fine-tuning models that activate domain-specific reasoning steps based on their judgment. Additionally, we propose Selective Tree Exploration to spontaneously explore solution spaces and sample optimal reasoning paths to improve performance. We also introduce PROOF-Score, a new metric for evaluating domain models' explainability, complementing traditional accuracy metrics with richer assessment dimensions. Extensive experiments on stock investment recommendation and legal reasoning QA tasks demonstrate Domaino1s's leading performance and explainability. Our code is available at https://github.com/Hyalinesky/Domaino1s.
中文: 本文提出的Domain$o1$s方法通过监督微调和选择性树搜索增强大语言模型在专业领域的推理能力,结合新建评估指标验证了其在金融投资和法律问答任务中具备领先性能与可解释性。
English: This paper introduces Domain$o1$s, a method that enhances LLMs' reasoning and explainability in high-stakes domains through supervised fine-tuning with specialized datasets and selective tree exploration, validated by a new evaluation metric and experiments showing superior performance.

Authors:Yoni Schirris, Rosie Voorthuis, Mark Opdam, Marte Liefaard, Gabe S Sonke, Gwen Dackus, Vincent de Jong, Yuwei Wang, Annelot Van Rossum, Tessa G Steenbruggen, Lars C Steggink, Liesbeth G. E. de Vries, Marc van de Vijver, Roberto Salgado, Efstratios Gavves, Paul J van Diest, Sabine C Linn, Jonas Teuwen, Renee Menezes, Marleen Kok, Hugo Horlings
Title: ECTIL: Label-efficient Computational Tumour Infiltrating Lymphocyte (TIL) assessment in breast cancer: Multicentre validation in 2,340 patients with breast cancer
Abstract:
The level of tumour-infiltrating lymphocytes (TILs) is a prognostic factor for patients with (triple-negative) breast cancer (BC). Computational TIL assessment (CTA) has the potential to assist pathologists in this labour-intensive task, but current CTA models rely heavily on many detailed annotations. We propose and validate a fundamentally simpler deep learning based CTA that can be trained in only ten minutes on hundredfold fewer pathologist annotations. We collected whole slide images (WSIs) with TILs scores and clinical data of 2,340 patients with BC from six cohorts including three randomised clinical trials. Morphological features were extracted from whole slide images (WSIs) using a pathology foundation model. Our label-efficient Computational stromal TIL assessment model (ECTIL) directly regresses the TILs score from these features. ECTIL trained on only a few hundred samples (ECTIL-TCGA) showed concordance with the pathologist over five heterogeneous external cohorts (r=0.54-0.74, AUROC=0.80-0.94). Training on all slides of five cohorts (ECTIL-combined) improved results on a held-out test set (r=0.69, AUROC=0.85). Multivariable Cox regression analyses indicated that every 10% increase of ECTIL scores was associated with improved overall survival independent of clinicopathological variables (HR 0.86, p<0.01), similar to the pathologist score (HR 0.87, p<0.001). We demonstrate that ECTIL is highly concordant with an expert pathologist and obtains a similar hazard ratio. ECTIL has a fundamentally simpler design than existing methods and can be trained on orders of magnitude fewer annotations. Such a CTA may be used to pre-screen patients for, e.g., immunotherapy clinical trial inclusion, or as a tool to assist clinicians in the diagnostic work-up of patients with BC. Our model is available under an open source licence (https://github.com/nki-ai/ectil).
中文: ECTIL是一种新型深度学习模型,能以极少的标注高效预测乳腺癌中肿瘤浸润淋巴细胞评分,与病理学家评估高度一致,并对患者生存具有重要预后价值。
English: ECTIL, a novel deep learning model, efficiently predicts tumor-infiltrating lymphocyte scores in breast cancer with minimal annotations, demonstrating high concordance with pathologists and significant prognostic value for patient survival.

Authors:Lingwei Zhu, Han Wang, Yukie Nagai
Title: Fat-to-Thin Policy Optimization: Offline RL with Sparse Policies
Abstract:
Sparse continuous policies are distributions that can choose some actions at random yet keep strictly zero probability for the other actions, which are radically different from the Gaussian. They have important real-world implications, e.g. in modeling safety-critical tasks like medicine. The combination of offline reinforcement learning and sparse policies provides a novel paradigm that enables learning completely from logged datasets a safety-aware sparse policy. However, sparse policies can cause difficulty with the existing offline algorithms which require evaluating actions that fall outside of the current support. In this paper, we propose the first offline policy optimization algorithm that tackles this challenge: Fat-to-Thin Policy Optimization (FtTPO). Specifically, we maintain a fat (heavy-tailed) proposal policy that effectively learns from the dataset and injects knowledge to a thin (sparse) policy, which is responsible for interacting with the environment. We instantiate FtTPO with the general $q$-Gaussian family that encompasses both heavy-tailed and sparse policies and verify that it performs favorably in a safety-critical treatment simulation and the standard MuJoCo suite. Our code is available at \url{https://github.com/lingweizhu/fat2thin}.
中文: 本文提出了Fat-to-Thin策略优化算法(FtTPO),这是首个能够通过从重尾提议策略向稀疏目标策略转移知识,从而有效从记录数据集中学习安全感知稀疏策略的离线算法,在安全关键仿真和标准测试中表现优异。
English: This paper introduces Fat-to-Thin Policy Optimization (FtTPO), the first offline algorithm that effectively learns safety-aware sparse policies from logged datasets by transferring knowledge from a heavy-tailed proposal policy to a sparse target policy, demonstrating strong performance in safety-critical simulations and standard benchmarks.

Authors:Xinyu Ma, Yifeng Xu, Yang Lin, Tianlong Wang, Xu Chu, Xin Gao, Junfeng Zhao, Yasha Wang
Title: DRESSing Up LLM: Efficient Stylized Question-Answering via Style Subspace Editing
Abstract:
We introduce DRESS, a novel approach for generating stylized large language model (LLM) responses through representation editing. Existing methods like prompting and fine-tuning are either insufficient for complex style adaptation or computationally expensive, particularly in tasks like NPC creation or character role-playing. Our approach leverages the over-parameterized nature of LLMs to disentangle a style-relevant subspace within the model's representation space to conduct representation editing, ensuring a minimal impact on the original semantics. By applying adaptive editing strengths, we dynamically adjust the steering vectors in the style subspace to maintain both stylistic fidelity and semantic integrity. We develop two stylized QA benchmark datasets to validate the effectiveness of DRESS, and the results demonstrate significant improvements compared to baseline methods such as prompting and ITI. In short, DRESS is a lightweight, train-free solution for enhancing LLMs with flexible and effective style control, making it particularly useful for developing stylized conversational agents. Codes and benchmark datasets are available at https://github.com/ArthurLeoM/DRESS-LLM.
Chinese: DRESS是一种轻量级、无需训练的新方法,通过在大语言模型的表示空间中编辑风格子空间,在保持语义完整性的同时实现灵活有效的风格控制,经新基准数据集验证优于现有基线方法。
English: DRESS is a lightweight, train-free method that enhances large language models by editing representations in a style subspace, achieving superior style control without compromising semantics, as validated on new benchmark datasets.

Authors:Weicai Yan, Ye Wang, Wang Lin, Zirun Guo, Zhou Zhao, Tao Jin
Title: Low-rank Prompt Interaction for Continual Vision-Language Retrieval
Abstract:
Research on continual learning in multi-modal tasks has been receiving increasing attention. However, most existing work overlooks the explicit cross-modal and cross-task interactions. In this paper, we innovatively propose the Low-rank Prompt Interaction (LPI) to address this general problem of multi-modal understanding, which considers both cross-modal and cross-task interactions. Specifically, as for the former, we employ multi-modal correlation modules for corresponding Transformer layers. Considering that the training parameters scale to the number of layers and tasks, we propose low-rank interaction-augmented decomposition to avoid memory explosion while enhancing the cross-modal association through sharing and separating common-specific low-rank factors. In addition, due to the multi-modal semantic differences carried by the low-rank initialization, we adopt hierarchical low-rank contrastive learning to ensure training robustness. As for the latter, we initially employ a visual analysis and identify that different tasks have clear distinctions in proximity. Therefore, we introduce explicit task contrastive constraints in the prompt learning process based on task semantic distances. Experiments on two retrieval tasks show performance improvements with the introduction of a minimal number of parameters, demonstrating the effectiveness of our method. Code is available at https://github.com/Kelvin-ywc/LPI.
Chinese: 本文提出低秩提示交互(LPI)方法,通过低秩分解和对比学习解决多模态持续学习中的跨模态与跨任务交互问题,以少量参数显著提升了检索任务的性能表现。
English: This paper introduces Low-rank Prompt Interaction (LPI), a novel method that enhances multi-modal continual learning by addressing cross-modal and cross-task interactions through low-rank decomposition and contrastive learning, achieving improved performance with minimal parameters.

Authors:Kai-Tuo Xu, Feng-Long Xie, Xu Tang, Yao Hu
Title: FireRedASR: Open-Source Industrial-Grade Mandarin Speech Recognition Models from Encoder-Decoder to LLM Integration
Abstract:
We present FireRedASR, a family of large-scale automatic speech recognition (ASR) models for Mandarin, designed to meet diverse requirements in superior performance and optimal efficiency across various applications. FireRedASR comprises two variants: FireRedASR-LLM: Designed to achieve state-of-the-art (SOTA) performance and to enable seamless end-to-end speech interaction. It adopts an Encoder-Adapter-LLM framework leveraging large language model (LLM) capabilities. On public Mandarin benchmarks, FireRedASR-LLM (8.3B parameters) achieves an average Character Error Rate (CER) of 3.05%, surpassing the latest SOTA of 3.33% with an 8.4% relative CER reduction (CERR). It demonstrates superior generalization capability over industrial-grade baselines, achieving 24%-40% CERR in multi-source Mandarin ASR scenarios such as video, live, and intelligent assistant. FireRedASR-AED: Designed to balance high performance and computational efficiency and to serve as an effective speech representation module in LLM-based speech models. It utilizes an Attention-based Encoder-Decoder (AED) architecture. On public Mandarin benchmarks, FireRedASR-AED (1.1B parameters) achieves an average CER of 3.18%, slightly worse than FireRedASR-LLM but still outperforming the latest SOTA model with over 12B parameters. It offers a more compact size, making it suitable for resource-constrained applications. Moreover, both models exhibit competitive results on Chinese dialects and English speech benchmarks and excel in singing lyrics recognition. To advance research in speech processing, we release our models and inference code at https://github.com/FireRedTeam/FireRedASR.
中文摘要:FireRedASR推出了两款中文自动语音识别模型:FireRedASR-LLM采用编码器-适配器-大语言模型框架实现最优性能,FireRedASR-AED基于注意力编码器-解码器架构平衡效率与精度,两者在多项基准测试中均超越现有最优模型,并支持多方言及歌词识别应用。
English Summary: FireRedASR introduces two Mandarin ASR model variants—FireRedASR-LLM for state-of-the-art performance using an Encoder-Adapter-LLM framework and FireRedASR-AED for efficiency with an Attention-based Encoder-Decoder—both surpassing existing benchmarks in accuracy and generalization across multiple scenarios.

Authors:Taha Emre, Teresa Araújo, Marzieh Oghbaie, Dmitrii Lachinov, Guilherme Aresta, Hrvoje Bogunović
Title: Automatic detection and prediction of nAMD activity change in retinal OCT using Siamese networks and Wasserstein Distance for ordinality
Abstract:
Neovascular age-related macular degeneration (nAMD) is a leading cause of vision loss among older adults, where disease activity detection and progression prediction are critical for nAMD management in terms of timely drug administration and improving patient outcomes. Recent advancements in deep learning offer a promising solution for predicting changes in AMD from optical coherence tomography (OCT) retinal volumes. In this work, we proposed deep learning models for the two tasks of the public MARIO Challenge at MICCAI 2024, designed to detect and forecast changes in nAMD severity with longitudinal retinal OCT. For the first task, we employ a Vision Transformer (ViT) based Siamese Network to detect changes in AMD severity by comparing scan embeddings of a patient from different time points. To train a model to forecast the change after 3 months, we exploit, for the first time, an Earth Mover (Wasserstein) Distance-based loss to harness the ordinal relation within the severity change classes. Both models ranked high on the preliminary leaderboard, demonstrating that their predictive capabilities could facilitate nAMD treatment management.
中文: 本研究提出基于视觉变换器的孪生网络模型,结合Wasserstein距离损失函数,能够通过纵向OCT扫描准确检测和预测新生血管性年龄相关性黄斑变性的严重程度变化,为临床治疗管理提供有效支持。
English: This study introduces deep learning models using Vision Transformers and Wasserstein Distance loss to effectively detect and predict neovascular AMD severity changes from longitudinal OCT scans, showing strong potential for improving treatment management.

Authors:Xiaohao Xu, Tianyi Zhang, Shibo Zhao, Xiang Li, Sibo Wang, Yongqi Chen, Ye Li, Bhiksha Raj, Matthew Johnson-Roberson, Sebastian Scherer, Xiaonan Huang
Title: Scalable Benchmarking and Robust Learning for Noise-Free Ego-Motion and 3D Reconstruction from Noisy Video
Abstract:
We aim to redefine robust ego-motion estimation and photorealistic 3D reconstruction by addressing a critical limitation: the reliance on noise-free data in existing models. While such sanitized conditions simplify evaluation, they fail to capture the unpredictable, noisy complexities of real-world environments. Dynamic motion, sensor imperfections, and synchronization perturbations lead to sharp performance declines when these models are deployed in practice, revealing an urgent need for frameworks that embrace and excel under real-world noise. To bridge this gap, we tackle three core challenges: scalable data generation, comprehensive benchmarking, and model robustness enhancement. First, we introduce a scalable noisy data synthesis pipeline that generates diverse datasets simulating complex motion, sensor imperfections, and synchronization errors. Second, we leverage this pipeline to create Robust-Ego3D, a benchmark rigorously designed to expose noise-induced performance degradation, highlighting the limitations of current learning-based methods in ego-motion accuracy and 3D reconstruction quality. Third, we propose Correspondence-guided Gaussian Splatting (CorrGS), a novel test-time adaptation method that progressively refines an internal clean 3D representation by aligning noisy observations with rendered RGB-D frames from clean 3D map, enhancing geometric alignment and appearance restoration through visual correspondence. Extensive experiments on synthetic and real-world data demonstrate that CorrGS consistently outperforms prior state-of-the-art methods, particularly in scenarios involving rapid motion and dynamic illumination.
中文摘要:本研究针对现有运动估计和三维重建模型难以应对现实世界噪声的局限,提出了包含合成数据生成、新基准测试和测试时自适应方法的完整框架,显著超越了现有技术性能。
English Summary: This research addresses the limitations of existing ego-motion estimation and 3D reconstruction models that struggle with real-world noise by introducing a comprehensive framework including synthetic data generation, a new benchmark, and a test-time adaptation method that significantly outperforms current approaches.

Authors:Sadegh Mahdavi, Muchen Li, Kaiwen Liu, Christos Thrampoulidis, Leonid Sigal, Renjie Liao
Title: Leveraging Online Olympiad-Level Math Problems for LLMs Training and Contamination-Resistant Evaluation
Abstract:
Advances in Large Language Models (LLMs) have sparked interest in their ability to solve Olympiad-level math problems. However, the training and evaluation of these models are constrained by the limited size and quality of available datasets, as creating large-scale data for such advanced problems requires extensive effort from human experts. In addition, current benchmarks are prone to contamination, leading to unreliable evaluations. In this paper, we present an automated pipeline that leverages the rich resources of the Art of Problem Solving (AoPS) forum, which predominantly features Olympiad-level problems and community-driven solutions. Using open-source LLMs, we develop a method to extract question-answer pairs from the forum, resulting in AoPS-Instruct, a dataset of more than 600,000 high-quality QA pairs. Our experiments demonstrate that fine-tuning LLMs on AoPS-Instruct improves their reasoning abilities across various benchmarks. Moreover, we build an automatic pipeline that introduces LiveAoPSBench, an evolving evaluation set with timestamps, derived from the latest forum data, providing a contamination-resistant benchmark for assessing LLM performance. Notably, we observe a significant decline in LLM performance over time, suggesting their success on older examples may stem from pre-training exposure rather than true reasoning ability. Our work presents a scalable approach to creating and maintaining large-scale, high-quality datasets for advanced math reasoning, offering valuable insights into the capabilities and limitations of LLMs in this domain. Our benchmark and code is available at https://github.com/DSL-Lab/aops
中文: 本文提出一种利用解题艺术论坛资源的自动化流程,构建了包含60多万高质量数学问答对的AoPS-Instruct数据集,实验表明该数据集能提升大语言模型的推理能力,同时通过带时间戳的动态基准揭示模型性能随时间下降的现象。
English: This paper introduces an automated pipeline using the Art of Problem Solving forum to create AoPS-Instruct, a large-scale dataset of over 600,000 high-quality math QA pairs, which enhances LLMs' reasoning abilities and provides a contamination-resistant benchmark revealing their performance decline over time.

Authors:Mitch Kosieradzki, Seongjin Choi
Title: TrajFlow: A Generative Framework for Occupancy Density Estimation Using Normalizing Flows
Abstract:
For intelligent transportation systems and autonomous vehicles to operate safely and efficiently, they must reliably predict the future motion and trajectory of surrounding agents within complex traffic environments. At the same time, the motion of these agents is inherently uncertain, making accurate prediction difficult. In this paper, we propose \textbf{TrajFlow}, a generative framework for estimating the occupancy density of dynamic agents. Our framework utilizes a causal encoder to extract semantically meaningful embeddings of the observed trajectory, as well as a normalizing flow to decode these embeddings and determine the most likely future location of an agent at some time point in the future. Our formulation differs from existing approaches because we model the marginal distribution of spatial locations instead of the joint distribution of unobserved trajectories. The advantages of a marginal formulation are numerous. First, we demonstrate that the marginal formulation produces higher accuracy on challenging trajectory forecasting benchmarks. Second, the marginal formulation allows for fully continuous sampling of future locations. Finally, marginal densities are better suited for downstream tasks as they allow for the computation of per-agent motion trajectories and occupancy grids, the two most commonly used representations for motion forecasting. We present a novel architecture based entirely on neural differential equations as an implementation of this framework and provide ablations to demonstrate the advantages of a continuous implementation over a more traditional discrete neural network based approach. The code is available at https://github.com/UMN-Choi-Lab/TrajFlow.
中文: 提出的TrajFlow框架通过因果编码器和归一化流建模动态智能体空间位置的边际分布,在复杂交通环境中实现了更高精度的运动预测和连续采样能力。
English: The proposed TrajFlow framework uses a causal encoder and normalizing flow to model the marginal distribution of dynamic agents' spatial locations, achieving higher accuracy and continuous sampling for motion prediction in complex traffic environments.

Authors:Yiyun Zhou, Wenkang Han, Jingyuan Chen
Title: DKT2: Revisiting Applicable and Comprehensive Knowledge Tracing in Large-Scale Data
Abstract:
Knowledge Tracing (KT) is a fundamental component of Intelligent Tutoring Systems (ITS), enabling the modeling of students' knowledge states to predict future performance. The introduction of Deep Knowledge Tracing (DKT), the first deep learning-based KT (DLKT) model, has brought significant advantages in terms of applicability and comprehensiveness. However, recent DLKT models, such as Attentive Knowledge Tracing (AKT), have often prioritized predictive performance at the expense of these benefits. While deep sequential models like DKT have shown potential, they face challenges related to parallel computing, storage decision modification, and limited storage capacity. To address these limitations, we propose DKT2, a novel KT model that leverages the recently developed xLSTM architecture. DKT2 enhances applicable input representation using the Rasch model and incorporates Item Response Theory (IRT) for output interpretability, allowing for the decomposition of learned knowledge into familiar and unfamiliar knowledge. By integrating this knowledge with predicted questions, DKT2 generates comprehensive knowledge states. Extensive experiments conducted across three large-scale datasets demonstrate that DKT2 consistently outperforms 18 baseline models in various prediction tasks, underscoring its potential for real-world educational applications. This work bridges the gap between theoretical advancements and practical implementation in KT. Our code and datasets are fully available at https://github.com/zyy-2001/DKT2.
中文: DKT2是一种新颖的知识追踪模型,它采用xLSTM架构,结合拉什模型和项目反应理论来增强输入表示和输出可解释性,并在多种预测任务中持续优于18个基线模型,显示出在实际教育应用中的巨大潜力。
English: DKT2 is a novel knowledge tracing model that utilizes the xLSTM architecture, integrates the Rasch model and Item Response Theory for enhanced input representation and output interpretability, and consistently outperforms 18 baseline models across various prediction tasks, demonstrating strong potential for real-world educational applications.

Authors:Yi Zhao, Youzhi Zhang
Title: Siren: A Learning-Based Multi-Turn Attack Framework for Simulating Real-World Human Jailbreak Behaviors
Abstract:
Large language models (LLMs) are widely used in real-world applications, raising concerns about their safety and trustworthiness. While red-teaming with jailbreak prompts exposes the vulnerabilities of LLMs, current efforts focus primarily on single-turn attacks, overlooking the multi-turn strategies used by real-world adversaries. Existing multi-turn methods rely on static patterns or predefined logical chains, failing to account for the dynamic strategies during attacks. We propose Siren, a learning-based multi-turn attack framework designed to simulate real-world human jailbreak behaviors. Siren consists of three stages: (1) training set construction utilizing Turn-Level LLM feedback (Turn-MF), (2) post-training attackers with supervised fine-tuning (SFT) and direct preference optimization (DPO), and (3) interactions between the attacking and target LLMs. Experiments demonstrate that Siren achieves an attack success rate (ASR) of 90% with LLaMA-3-8B as the attacker against Gemini-1.5-Pro as the target model, and 70% with Mistral-7B against GPT-4o, significantly outperforming single-turn baselines. Moreover, Siren with a 7B-scale model achieves performance comparable to a multi-turn baseline that leverages GPT-4o as the attacker, while requiring fewer turns and employing decomposition strategies that are better semantically aligned with attack goals. We hope Siren inspires the development of stronger defenses against advanced multi-turn jailbreak attacks under realistic scenarios. Code is available at https://github.com/YiyiyiZhao/siren. Warning: This paper contains potentially harmful text.
中文:Siren框架通过基于学习的多轮攻击策略模拟真实的人类越狱行为,在对抗Gemini-1.5-Pro和GPT-4o等先进模型时实现了高达90%和70%的攻击成功率,显著优于现有方法。
English: The proposed Siren framework simulates realistic multi-turn jailbreak attacks on large language models by employing learning-based strategies, achieving high success rates against advanced models like Gemini-1.5-Pro and GPT-4o while outperforming existing methods.

Authors:Hanrui Wang, Ching-Chun Chang, Chun-Shien Lu, Christopher Leckie, Isao Echizen
Title: GreedyPixel: Fine-Grained Black-Box Adversarial Attack Via Greedy Algorithm
Abstract:
Deep neural networks are highly vulnerable to adversarial examples that inputs with small, carefully crafted perturbations that cause misclassification, making adversarial attacks an essential tool for robustness evaluation. Existing black-box attacks fall into three categories: query-only, transfer-only, and query-and-transfer, and vary in perturbation pattern and optimization strategy. However, no prior method jointly achieves query-and-transfer guidance, pixel-wise sparsity, and training-free direct optimization, leaving a gap between black-box flexibility and white-box precision. We present GreedyPixel, a new attack framework that fills this gap by combining a surrogate-derived pixel priority map with greedy, per-pixel optimization refined by query feedback. This design reduces the exponential brute-force search space to a tractable linear procedure, guarantees monotonic loss decrease and convergence to a coordinate-wise optimum, and concentrates perturbations on robust, semantically meaningful pixels to improve perceptual quality. Extensive experiments on CIFAR-10 and ImageNet under both white-box and black-box settings demonstrate that GreedyPixel achieves state-of-the-art attack success rates and produces visually imperceptible perturbations. Our results show that GreedyPixel bridges the precision gap between white-box and black-box attacks and provides a practical framework for fine-grained robustness evaluation. The implementation is available at https://github.com/azrealwang/greedypixel.
中文摘要:GreedyPixel是一种新颖的逐像素贪婪算法,仅通过查询反馈即可高效生成高质量对抗样本,在无需梯度信息的情况下达到白盒攻击成功率,同时实现难以察觉的扰动和卓越的计算效率。
English Summary: GreedyPixel is a novel pixel-wise greedy algorithm that efficiently generates high-quality adversarial examples using only query-based feedback, achieving white-box level attack success rates without gradient access while ensuring imperceptible perturbations and superior computational efficiency.

Authors:Runyi Hu, Jie Zhang, Yiming Li, Jiwei Li, Qing Guo, Han Qiu, Tianwei Zhang
Title: VideoShield: Regulating Diffusion-based Video Generation Models via Watermarking
Abstract:
Artificial Intelligence Generated Content (AIGC) has advanced significantly, particularly with the development of video generation models such as text-to-video (T2V) models and image-to-video (I2V) models. However, like other AIGC types, video generation requires robust content control. A common approach is to embed watermarks, but most research has focused on images, with limited attention given to videos. Traditional methods, which embed watermarks frame-by-frame in a post-processing manner, often degrade video quality. In this paper, we propose VideoShield, a novel watermarking framework specifically designed for popular diffusion-based video generation models. Unlike post-processing methods, VideoShield embeds watermarks directly during video generation, eliminating the need for additional training. To ensure video integrity, we introduce a tamper localization feature that can detect changes both temporally (across frames) and spatially (within individual frames). Our method maps watermark bits to template bits, which are then used to generate watermarked noise during the denoising process. Using DDIM Inversion, we can reverse the video to its original watermarked noise, enabling straightforward watermark extraction. Additionally, template bits allow precise detection for potential temporal and spatial modification. Extensive experiments across various video models (both T2V and I2V models) demonstrate that our method effectively extracts watermarks and detects tamper without compromising video quality. Furthermore, we show that this approach is applicable to image generation models, enabling tamper detection in generated images as well. Codes and models are available at https://github.com/hurunyi/VideoShield.
中文: VideoShield是一种创新的水印框架,直接在基于扩散的视频生成过程中嵌入水印,既能保持视频质量又能实现精准篡改检测,且无需额外训练。
English: VideoShield is a novel watermarking framework that embeds watermarks directly during diffusion-based video generation, enabling robust tamper detection without compromising quality or requiring additional training.

Authors:Mojtaba Safari, Zach Eidex, Chih-Wei Chang, Richard L. J. Qiu, Xiaofeng Yang
Title: Advancing MRI Reconstruction: A Systematic Review of Deep Learning and Compressed Sensing Integration
Abstract:
Magnetic resonance imaging (MRI) is a non-invasive imaging modality and provides comprehensive anatomical and functional insights into the human body. However, its long acquisition times can lead to patient discomfort, motion artifacts, and limiting real-time applications. To address these challenges, strategies such as parallel imaging have been applied, which utilize multiple receiver coils to speed up the data acquisition process. Additionally, compressed sensing (CS) is a method that facilitates image reconstruction from sparse data, significantly reducing image acquisition time by minimizing the amount of data collection needed. Recently, deep learning (DL) has emerged as a powerful tool for improving MRI reconstruction. It has been integrated with parallel imaging and CS principles to achieve faster and more accurate MRI reconstructions. This review comprehensively examines DL-based techniques for MRI reconstruction. We categorize and discuss various DL-based methods, including end-to-end approaches, unrolled optimization, and federated learning, highlighting their potential benefits. Our systematic review highlights significant contributions and underscores the potential of DL in MRI reconstruction. Additionally, we summarize key results and trends in DL-based MRI reconstruction, including quantitative metrics, the dataset, acceleration factors, and the progress of and research interest in DL techniques over time. Finally, we discuss potential future directions and the importance of DL-based MRI reconstruction in advancing medical imaging. To facilitate further research in this area, we provide a GitHub repository that includes up-to-date DL-based MRI reconstruction publications and public datasets-https://github.com/mosaf/Awesome-DL-based-CS-MRI.
中文: 深度学习技术通过与并行成像和压缩感知相结合,正在革新磁共振成像重建过程,显著缩短采集时间并提升图像精度;本综述系统分类了各类方法并展望未来方向,同时提供了资源库支持后续研究。
English: Deep learning techniques are revolutionizing MRI reconstruction by integrating with parallel imaging and compressed sensing to significantly reduce acquisition times and improve image accuracy, with this review systematically categorizing methods and highlighting future directions while providing a resource repository.

Authors:Joshua Davis, Thomas Sounack, Kate Sciacca, Jessie M Brain, Brigitte N Durieux, Nicole D Agaronnik, Charlotta Lindvall
Title: MedSlice: Fine-Tuned Large Language Models for Secure Clinical Note Sectioning
Abstract:
Extracting sections from clinical notes is crucial for downstream analysis but is challenging due to variability in formatting and labor-intensive nature of manual sectioning. While proprietary large language models (LLMs) have shown promise, privacy concerns limit their accessibility. This study develops a pipeline for automated note sectioning using open-source LLMs, focusing on three sections: History of Present Illness, Interval History, and Assessment and Plan. We fine-tuned three open-source LLMs to extract sections using a curated dataset of 487 progress notes, comparing results relative to proprietary models (GPT-4o, GPT-4o mini). Internal and external validity were assessed via precision, recall and F1 score. Fine-tuned Llama 3.1 8B outperformed GPT-4o (F1=0.92). On the external validity test set, performance remained high (F1= 0.85). Fine-tuned open-source LLMs can surpass proprietary models in clinical note sectioning, offering advantages in cost, performance, and accessibility.
中文: 本研究开发了一种基于微调开源大语言模型的临床笔记自动分段流程,在超越专有模型性能的同时,有效解决了隐私保护和可访问性问题。
English: This study develops a pipeline using fine-tuned open-source LLMs to automate clinical note sectioning, demonstrating superior performance over proprietary models while addressing privacy and accessibility concerns.

Authors:Po-Ting Lai, Chih-Hsuan Wei, Shubo Tian, Robert Leaman, Zhiyong Lu
Title: Enhancing Biomedical Relation Extraction with Directionality
Abstract:
Biological relation networks contain rich information for understanding the biological mechanisms behind the relationship of entities such as genes, proteins, diseases, and chemicals. The vast growth of biomedical literature poses significant challenges updating the network knowledge. The recent Biomedical Relation Extraction Dataset (BioRED) provides valuable manual annotations, facilitating the develop-ment of machine-learning and pre-trained language model approaches for automatically identifying novel document-level (inter-sentence context) relationships. Nonetheless, its annotations lack directionality (subject/object) for the entity roles, essential for studying complex biological networks. Herein we annotate the entity roles of the relationships in the BioRED corpus and subsequently propose a novel multi-task language model with soft-prompt learning to jointly identify the relationship, novel findings, and entity roles. Our results in-clude an enriched BioRED corpus with 10,864 directionality annotations. Moreover, our proposed method outperforms existing large language models such as the state-of-the-art GPT-4 and Llama-3 on two benchmarking tasks. Our source code and dataset are available at https://github.com/ncbi-nlp/BioREDirect.
Chinese: 本研究通过为BioRED语料库添加实体角色方向性注释,并提出了一个多任务语言模型,在识别关系和发现新知识方面超越了GPT-4和Llama-3等先进模型。
English: This study enhances the BioRED corpus by adding directional annotations for entity roles and introduces a multi-task language model that surpasses advanced models like GPT-4 and Llama-3 in identifying relationships and novel findings.

Authors:Sneh Pandya, Purvik Patel, Brian D. Nord, Mike Walmsley, Aleksandra Ćiprijanović
Title: SIDDA: SInkhorn Dynamic Domain Adaptation for Image Classification with Equivariant Neural Networks
Abstract:
Modern neural networks (NNs) often do not generalize well in the presence of a "covariate shift"; that is, in situations where the training and test data distributions differ, but the conditional distribution of classification labels remains unchanged. In such cases, NN generalization can be reduced to a problem of learning more domain-invariant features. Domain adaptation (DA) methods include a range of techniques aimed at achieving this; however, these methods have struggled with the need for extensive hyperparameter tuning, which then incurs significant computational costs. In this work, we introduce SIDDA, an out-of-the-box DA training algorithm built upon the Sinkhorn divergence, that can achieve effective domain alignment with minimal hyperparameter tuning and computational overhead. We demonstrate the efficacy of our method on multiple simulated and real datasets of varying complexity, including simple shapes, handwritten digits, and real astronomical observations. SIDDA is compatible with a variety of NN architectures, and it works particularly well in improving classification accuracy and model calibration when paired with equivariant neural networks (ENNs). We find that SIDDA enhances the generalization capabilities of NNs, achieving up to a $\approx40\%$ improvement in classification accuracy on unlabeled target data. We also study the efficacy of DA on ENNs with respect to the varying group orders of the dihedral group $D_N$, and find that the model performance improves as the degree of equivariance increases. Finally, we find that SIDDA enhances model calibration on both source and target data--achieving over an order of magnitude improvement in the ECE and Brier score. SIDDA's versatility, combined with its automated approach to domain alignment, has the potential to advance multi-dataset studies by enabling the development of highly generalizable models.
中文: SIDDA是一种基于Sinkhorn散度的高效域自适应算法,能以最少的超参数调优实现域对齐,显著提升神经网络在不同数据集上的泛化能力和模型校准效果。
English: SIDDA is an efficient domain adaptation algorithm that uses Sinkhorn divergence to achieve domain alignment with minimal hyperparameter tuning, significantly improving neural network generalization and model calibration across various datasets.

Authors:Andrey Palaev, Adil Khan, Syed M. Ahsan Kazmi
Title: LLM-guided Instance-level Image Manipulation with Diffusion U-Net Cross-Attention Maps
Abstract:
The advancement of text-to-image synthesis has introduced powerful generative models capable of creating realistic images from textual prompts. However, precise control over image attributes remains challenging, especially at the instance level. While existing methods offer some control through fine-tuning or auxiliary information, they often face limitations in flexibility and accuracy. To address these challenges, we propose a pipeline leveraging Large Language Models (LLMs), open-vocabulary detectors, cross-attention maps and intermediate activations of diffusion U-Net for instance-level image manipulation. Our method detects objects mentioned in the prompt and present in the generated image, enabling precise manipulation without extensive training or input masks. By incorporating cross-attention maps, our approach ensures coherence in manipulated images while controlling object positions. Our method enables precise manipulations at the instance level without fine-tuning or auxiliary information such as masks or bounding boxes. Code is available at https://github.com/Palandr123/DiffusionU-NetLLM
Chinese Summary: 本文提出了一种新颖方法,利用大型语言模型、开放词汇检测器和扩散U-Net组件,无需微调或额外输入掩码即可实现基于文本提示的实例级图像精确操控。
English Summary: This paper introduces a novel pipeline that utilizes Large Language Models, open-vocabulary detectors, and diffusion U-Net components to achieve precise instance-level image manipulation from text prompts without requiring fine-tuning or additional input masks.

Authors:Luqi Zhang, Haiping Wang, Chong Liu, Zhen Dong, Bisheng Yang
Title: ME-CPT: Multi-Task Enhanced Cross-Temporal Point Transformer for Urban 3D Change Detection
Abstract:
The point clouds collected by the Airborne Laser Scanning (ALS) system provide accurate 3D information of urban land covers. By utilizing multi-temporal ALS point clouds, semantic changes in urban area can be captured, demonstrating significant potential in urban planning, emergency management, and infrastructure maintenance. Existing 3D change detection methods struggle to efficiently extract multi-class semantic information and change features, still facing the following challenges: (1) the difficulty of accurately modeling cross-temporal point clouds spatial relationships for effective change feature extraction; (2) class imbalance of change samples which hinders distinguishability of semantic features; (3) the lack of real-world datasets for 3D semantic change detection. To resolve these challenges, we propose the Multi-task Enhanced Cross-temporal Point Transformer (ME-CPT) network. ME-CPT establishes spatiotemporal correspondences between point cloud across different epochs and employs attention mechanisms to jointly extract semantic change features, facilitating information exchange and change comparison. Additionally, we incorporate a semantic segmentation task and through the multi-task training strategy, further enhance the distinguishability of semantic features, reducing the impact of class imbalance in change types. Moreover, we release a 22.5 $km^2$ 3D semantic change detection dataset, offering diverse scenes for comprehensive evaluation. Experiments on multiple datasets show that the proposed MT-CPT achieves superior performance compared to existing state-of-the-art methods. The source code and dataset will be released upon acceptance at https://github.com/zhangluqi0209/ME-CPT.
中文: ME-CPT网络通过建立跨时空点云的时空对应关系并利用注意力机制,有效从多时机ALS点云中提取城市语义变化特征,同时通过多任务训练缓解类别不平衡问题,并发布了新的评估数据集。
English: The ME-CPT network effectively captures semantic changes in urban areas from multi-temporal ALS point clouds by establishing spatiotemporal correspondences and using attention mechanisms, while also addressing class imbalance through multi-task training and providing a new dataset for evaluation.

Authors:Ioannis Nasios
Title: Enhancing kelp forest detection in remote sensing images using crowdsourced labels with Mixed Vision Transformers and ConvNeXt segmentation models
Abstract:
Kelp forests, as foundation species, are vital to marine ecosystems, providing essential food and habitat for numerous organisms. This study explores the integration of crowdsourced labels with advanced artificial intelligence models to develop a fast and accurate kelp canopy detection pipeline using Landsat images. Building on the success of a machine learning competition, where this approach ranked third and performed consistently well on both local validation and public and private leaderboards, the research highlights the effectiveness of combining Mixed Vision Transformers (MIT) with ConvNeXt models. Training these models on various image sizes significantly enhanced the accuracy of the ensemble results. U-Net emerged as the best segmentation architecture, with UpperNet also contributing to the final ensemble. Key Landsat bands, such as ShortWave InfraRed (SWIR1) and Near-InfraRed (NIR), were crucial while altitude data was used in postprocessing to eliminate false positives on land. The methodology achieved a high detection rate, accurately identifying about three out of four pixels containing kelp canopy while keeping false positives low. Despite the medium resolution of Landsat satellites, their extensive historical coverage makes them effective for studying kelp forests. This work also underscores the potential of combining machine learning models with crowdsourced data for effective and scalable environmental monitoring. All running code for training all models and inference can be found at https://github.com/IoannisNasios/Kelp_Forests.
中文摘要:本研究通过整合众包标签与混合视觉转换器和ConvNeXt等人工智能模型,利用Landsat图像开发出高效的海藻冠层检测方法,在中等卫星分辨率下仍能实现精确识别。
English Summary: This study demonstrates a highly effective method for detecting kelp canopies using Landsat imagery by integrating crowdsourced labels with AI models like Mixed Vision Transformers and ConvNeXt, achieving accurate results despite medium satellite resolution.

Authors:Yi Yang, Zhang Zhang, Liang Wang
Title: MCRL4OR: Multimodal Contrastive Representation Learning for Off-Road Environmental Perception
Abstract:
Most studies on environmental perception for autonomous vehicles (AVs) focus on urban traffic environments, where the objects/stuff to be perceived are mainly from man-made scenes and scalable datasets with dense annotations can be used to train supervised learning models. By contrast, it is hard to densely annotate a large-scale off-road driving dataset manually due to the inherently unstructured nature of off-road environments. In this paper, we propose a Multimodal Contrastive Representation Learning approach for Off-Road environmental perception, namely MCRL4OR. This approach aims to jointly learn three encoders for processing visual images, locomotion states, and control actions by aligning the locomotion states with the fused features of visual images and control actions within a contrastive learning framework. The causation behind this alignment strategy is that the inertial locomotion state is the result of taking a certain control action under the current landform/terrain condition perceived by visual sensors. In experiments, we pre-train the MCRL4OR with a large-scale off-road driving dataset and adopt the learned multimodal representations for various downstream perception tasks in off-road driving scenarios. The superior performance in downstream tasks demonstrates the advantages of the pre-trained multimodal representations. The codes can be found in \url{https://github.com/1uciusy/MCRL4OR}.
中文: 本文提出MCRL4OR多模态对比学习方法,通过将运动状态与视觉和控制特征融合对齐来提升越野环境感知能力,在下游任务中展现出优越性能。
English: This paper introduces MCRL4OR, a multimodal contrastive learning method that aligns locomotion states with fused visual and control features to enhance off-road environmental perception, demonstrating superior performance in downstream tasks.

Authors:Xing Hu, Yuan Cheng, Dawei Yang, Zukang Xu, Zhihang Yuan, Jiangyong Yu, Chen Xu, Zhe Jiang, Sifan Zhou
Title: OstQuant: Refining Large Language Model Quantization with Orthogonal and Scaling Transformations for Better Distribution Fitting
Abstract:
Post-training quantization (PTQ) has emerged as a widely adopted technique for compressing and accelerating Large Language Models (LLMs). The major challenge in LLM quantization is that uneven and heavy-tailed data distributions can expand the quantization range, thereby reducing bit precision for most values. Recent methods attempt to eliminate outliers and balance inter-channel differences by employing linear transformations; however, they remain heuristic and are often overlook optimizing the data distribution across the entire quantization space.In this paper, we introduce Quantization Space Utilization Rate (QSUR), a novel metric that effectively assesses the quantizability of transformed data by measuring the space utilization of the data in the quantization space. We complement QSUR with mathematical derivations that examine the effects and limitations of various transformations, guiding our development of Orthogonal and Scaling Transformation-based Quantization (OSTQuant). OSQuant employs a learnable equivalent transformation, consisting of an orthogonal transformation and a scaling transformation, to optimize the distributions of weights and activations across the entire quantization space. Futhermore, we propose the KL-Top loss function, designed to mitigate noise during optimization while retaining richer semantic information within the limited calibration data imposed by PTQ. OSTQuant outperforms existing work on various LLMs and benchmarks. In the W4-only setting, it retains 99.5\% of the floating-point accuracy. In the more challenging W4A4KV4 configuration, OSTQuant reduces the performance gap by 32\% on the LLaMA-3-8B model compared to state-of-the-art methods. \href{https://github.com/BrotherHappy/OSTQuant}{https://github.com/BrotherHappy/OSTQuant}.
中文: 本文提出OSTQuant方法,通过正交变换和缩放变换优化量化空间中的数据分布,在大语言模型压缩中实现了卓越性能,同时保持高精度。
English: This paper introduces OSTQuant, a novel quantization method that optimizes data distribution across the quantization space using orthogonal and scaling transformations, achieving superior performance in compressing Large Language Models while maintaining high accuracy.

Authors:Chengyi Cai, Zesheng Ye, Lei Feng, Jianzhong Qi, Feng Liu
Title: Attribute-based Visual Reprogramming for Vision-Language Models
Abstract:
Visual reprogramming (VR) reuses pre-trained vision models for downstream image classification tasks by adding trainable noise patterns to inputs. When applied to vision-language models (e.g., CLIP), existing VR approaches follow the same pipeline used in vision models (e.g., ResNet, ViT), where ground-truth class labels are inserted into fixed text templates to guide the optimization of VR patterns. This label-based approach, however, overlooks the rich information and diverse attribute-guided textual representations that CLIP can exploit, which may lead to the misclassification of samples. In this paper, we propose Attribute-based Visual Reprogramming (AttrVR) for CLIP, utilizing descriptive attributes (DesAttrs) and distinctive attributes (DistAttrs), which respectively represent common and unique feature descriptions for different classes. Besides, as images of the same class may reflect different attributes after VR, AttrVR iteratively refines patterns using the $k$-nearest DesAttrs and DistAttrs for each image sample, enabling more dynamic and sample-specific optimization. Theoretically, AttrVR is shown to reduce intra-class variance and increase inter-class separation. Empirically, it achieves superior performance in 12 downstream tasks for both ViT-based and ResNet-based CLIP. The success of AttrVR facilitates more effective integration of VR from unimodal vision models into vision-language models. Our code is available at https://github.com/tmlr-group/AttrVR.
Chinese Summary: 本文提出基于属性的视觉重编程方法(AttrVR),通过利用描述性和区分性属性动态优化输入模式,有效降低类内差异并提升CLIP模型在12个下游任务中的分类性能。
English Summary: This paper introduces Attribute-based Visual Reprogramming (AttrVR) for CLIP, which leverages descriptive and distinctive attributes to dynamically optimize input patterns, reducing intra-class variance and improving classification performance across 12 downstream tasks.

Authors:Yicheng Tao, Haotian Liu, Shanwen Wang, Hongteng Xu
Title: Learning an Effective Premise Retrieval Model for Efficient Mathematical Formalization
Abstract:
Formalized mathematics has recently garnered significant attention for its ability to assist mathematicians across various fields. Premise retrieval, as a common step in mathematical formalization, has been a challenge, particularly for inexperienced users. Existing retrieval methods that facilitate natural language queries require a certain level of mathematical expertise from users, while approaches based on formal languages (e.g., Lean) typically struggle with the scarcity of training data, hindering the training of effective and generalizable retrieval models. In this work, we introduce a novel method that leverages data extracted from Mathlib to train a lightweight and effective premise retrieval model. In particular, the proposed model embeds queries (i.e., proof state provided by Lean) and premises in a latent space, featuring a tokenizer specifically trained on formal corpora. The model is learned in a contrastive learning framework, in which a fine-grained similarity calculation method and a re-ranking module are applied to enhance the retrieval performance. Experimental results demonstrate that our model outperforms existing baselines, achieving higher accuracy while maintaining a lower computational load. We have released an open-source search engine based on our retrieval model at https://premise-search.com/. The source code and the trained model can be found at https://github.com/ruc-ai4math/Premise-Retrieval.
Chinese Summary: 本研究提出了一种新颖的形式化数学前提检索模型,利用Mathlib数据和对比学习框架,在降低计算负载的同时实现了比现有方法更高的检索准确率。
English Summary: This study introduces a novel premise retrieval model for formalized mathematics that uses data from Mathlib and a contrastive learning framework to achieve higher accuracy with lower computational costs compared to existing methods.

Authors:Qinggang Zhang, Shengyuan Chen, Yuanchen Bei, Zheng Yuan, Huachi Zhou, Zijin Hong, Hao Chen, Yilin Xiao, Chuang Zhou, Yi Chang, Xiao Huang
Title: A Survey of Graph Retrieval-Augmented Generation for Customized Large Language Models
Abstract:
Large language models (LLMs) have demonstrated remarkable capabilities in a wide range of tasks, yet their application to specialized domains remains challenging due to the need for deep expertise. Retrieval-Augmented generation (RAG) has emerged as a promising solution to customize LLMs for professional fields by seamlessly integrating external knowledge bases, enabling real-time access to domain-specific expertise during inference. Despite its potential, traditional RAG systems, based on flat text retrieval, face three critical challenges: (i) complex query understanding in professional contexts, (ii) difficulties in knowledge integration across distributed sources, and (iii) system efficiency bottlenecks at scale. This survey presents a systematic analysis of Graph-based Retrieval-Augmented Generation (GraphRAG), a new paradigm that revolutionizes domain-specific LLM applications. GraphRAG addresses traditional RAG limitations through three key innovations: (i) graph-structured knowledge representation that explicitly captures entity relationships and domain hierarchies, (ii) efficient graph-based retrieval techniques that enable context-preserving knowledge retrieval with multihop reasoning ability, and (iii) structure-aware knowledge integration algorithms that leverage retrieved knowledge for accurate and logical coherent generation of LLMs. In this survey, we systematically analyze the technical foundations of GraphRAG and examine current implementations across various professional domains, identifying key technical challenges and promising research directions. All the related resources of GraphRAG, including research papers, open-source data, and projects, are collected for the community in https://github.com/DEEP-PolyU/Awesome-GraphRAG.
Chinese: GraphRAG通过图结构知识表示与检索技术,克服了传统RAG系统的局限,显著提升了专业领域大语言模型应用的推理能力和知识整合效果。
English: GraphRAG overcomes traditional RAG limitations by using graph-structured knowledge representation and retrieval techniques to enhance domain-specific LLM applications with improved reasoning and integration capabilities.

Authors:Qinggang Zhang, Shengyuan Chen, Yuanchen Bei, Zheng Yuan, Huachi Zhou, Zijin Hong, Hao Chen, Yilin Xiao, Chuang Zhou, Junnan Dong, Yi Chang, Xiao Huang
Title: A Survey of Graph Retrieval-Augmented Generation for Customized Large Language Models
Abstract:
Large language models (LLMs) have demonstrated remarkable capabilities in a wide range of tasks, yet their application to specialized domains remains challenging due to the need for deep expertise. Retrieval-Augmented generation (RAG) has emerged as a promising solution to customize LLMs for professional fields by seamlessly integrating external knowledge bases, enabling real-time access to domain-specific expertise during inference. Despite its potential, traditional RAG systems, based on flat text retrieval, face three critical challenges: (i) complex query understanding in professional contexts, (ii) difficulties in knowledge integration across distributed sources, and (iii) system efficiency bottlenecks at scale. This survey presents a systematic analysis of Graph-based Retrieval-Augmented Generation (GraphRAG), a new paradigm that revolutionizes domain-specific LLM applications. GraphRAG addresses traditional RAG limitations through three key innovations: (i) graph-structured knowledge representation that explicitly captures entity relationships and domain hierarchies, (ii) efficient graph-based retrieval techniques that enable context-preserving knowledge retrieval with multihop reasoning ability, and (iii) structure-aware knowledge integration algorithms that leverage retrieved knowledge for accurate and logical coherent generation of LLMs. In this survey, we systematically analyze the technical foundations of GraphRAG and examine current implementations across various professional domains, identifying key technical challenges and promising research directions. All the related resources of GraphRAG, including research papers, open-source data, and projects, are collected for the community in https://github.com/DEEP-PolyU/Awesome-GraphRAG.
Chinese: GraphRAG通过图结构知识表示与检索技术,克服了传统RAG系统的局限,显著提升了专业领域大语言模型应用的推理能力和知识整合效果。
English: GraphRAG overcomes traditional RAG limitations by using graph-structured knowledge representation and retrieval techniques to enhance domain-specific LLM applications with improved reasoning and integration capabilities.

Authors:Zicheng Zhang, Xiangyu Zhao, Xinyu Fang, Chunyi Li, Xiaohong Liu, Xiongkuo Min, Haodong Duan, Kai Chen, Guangtao Zhai
Title: Redundancy Principles for MLLMs Benchmarks
Abstract:
With the rapid iteration of Multi-modality Large Language Models (MLLMs) and the evolving demands of the field, the number of benchmarks produced annually has surged into the hundreds. The rapid growth has inevitably led to significant redundancy among benchmarks. Therefore, it is crucial to take a step back and critically assess the current state of redundancy and propose targeted principles for constructing effective MLLM benchmarks. In this paper, we focus on redundancy from three key perspectives: 1) Redundancy of benchmark capability dimensions, 2) Redundancy in the number of test questions, and 3) Cross-benchmark redundancy within specific domains. Through the comprehensive analysis over hundreds of MLLMs' performance across more than 20 benchmarks, we aim to quantitatively measure the level of redundancy lies in existing MLLM evaluations, provide valuable insights to guide the future development of MLLM benchmarks, and offer strategies to refine and address redundancy issues effectively. The code is available at https://github.com/zzc-1998/Benchmark-Redundancy.
中文: 本文针对多模态大语言模型基准中日益严重的冗余问题,从能力维度、测试题量和跨基准重叠三个关键角度进行分析,旨在提出针对性原则以优化评估体系。
English: This paper addresses the growing redundancy in Multi-modality Large Language Model benchmarks by analyzing three key aspects—capability dimensions, test question volume, and cross-benchmark overlap—to propose targeted principles for more effective evaluation.

Authors:Ziyu Guo, Renrui Zhang, Chengzhuo Tong, Zhizheng Zhao, Rui Huang, Haoquan Zhang, Manyuan Zhang, Jiaming Liu, Shanghang Zhang, Peng Gao, Hongsheng Li, Pheng-Ann Heng
Title: Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step
Abstract:
Chain-of-Thought (CoT) reasoning has been extensively explored in large models to tackle complex understanding tasks. However, it still remains an open question whether such strategies can be applied to verifying and reinforcing image generation scenarios. In this paper, we provide the first comprehensive investigation of the potential of CoT reasoning to enhance autoregressive image generation. We focus on three techniques: scaling test-time computation for verification, aligning model preferences with Direct Preference Optimization (DPO), and integrating these techniques for complementary effects. Our results demonstrate that these approaches can be effectively adapted and combined to significantly improve image generation performance. Furthermore, given the pivotal role of reward models in our findings, we propose the Potential Assessment Reward Model (PARM) and PARM++, specialized for autoregressive image generation. PARM adaptively assesses each generation step through a potential assessment approach, merging the strengths of existing reward models, and PARM++ further introduces a reflection mechanism to self-correct the generated unsatisfactory image, which is the first to incorporate reflection in autoregressive image generation. Using our investigated reasoning strategies, we enhance a baseline model, Show-o, to achieve superior results, with a significant +24% improvement on the GenEval benchmark, surpassing Stable Diffusion 3 by +15%. We hope our study provides unique insights and paves a new path for integrating CoT reasoning with autoregressive image generation. Code and models are released at https://github.com/ZiyuGuo99/Image-Generation-CoT
中文摘要:本研究首次将思维链推理应用于自回归图像生成,通过结合验证扩展、偏好对齐及新型奖励模型(PARM/PARM++),显著提升了生成性能,在基准测试中实现了24%的突破性改进。
English Summary: This study pioneers the application of Chain-of-Thought reasoning to autoregressive image generation, demonstrating that combining verification scaling, preference alignment, and novel reward models (PARM/PARM++) significantly enhances performance, achieving a 24% improvement on benchmarks.

Authors:Hao Dong, Eleni Chatzi, Olga Fink
Title: Towards Robust Multimodal Open-set Test-time Adaptation via Adaptive Entropy-aware Optimization
Abstract:
Test-time adaptation (TTA) has demonstrated significant potential in addressing distribution shifts between training and testing data. Open-set test-time adaptation (OSTTA) aims to adapt a source pre-trained model online to an unlabeled target domain that contains unknown classes. This task becomes more challenging when multiple modalities are involved. Existing methods have primarily focused on unimodal OSTTA, often filtering out low-confidence samples without addressing the complexities of multimodal data. In this work, we present Adaptive Entropy-aware Optimization (AEO), a novel framework specifically designed to tackle Multimodal Open-set Test-time Adaptation (MM-OSTTA) for the first time. Our analysis shows that the entropy difference between known and unknown samples in the target domain strongly correlates with MM-OSTTA performance. To leverage this, we propose two key components: Unknown-aware Adaptive Entropy Optimization (UAE) and Adaptive Modality Prediction Discrepancy Optimization (AMP). These components enhance the ability of model to distinguish unknown class samples during online adaptation by amplifying the entropy difference between known and unknown samples. To thoroughly evaluate our proposed methods in the MM-OSTTA setting, we establish a new benchmark derived from existing datasets. This benchmark includes two downstream tasks and incorporates five modalities. Extensive experiments across various domain shift situations demonstrate the efficacy and versatility of the AEO framework. Additionally, we highlight the strong performance of AEO in long-term and continual MM-OSTTA settings, both of which are challenging and highly relevant to real-world applications. Our source code is available at https://github.com/donghao51/AEO.
中文: AEO框架通过自适应熵优化增强多模态开放集测试时适应能力,利用已知与未知样本间的熵差异提升模型对未知类别的识别性能。
English: The AEO framework introduces adaptive entropy optimization to enhance multimodal open-set test-time adaptation by distinguishing unknown classes through amplified entropy differences between known and unknown samples.

Authors:Jiayi Lei, Renrui Zhang, Xiangfei Hu, Weifeng Lin, Zhen Li, Wenjian Sun, Ruoyi Du, Le Zhuo, Zhongyu Li, Xinyue Li, Shitian Zhao, Ziyu Guo, Yiting Lu, Peng Gao, Hongsheng Li
Title: IMAGINE-E: Image Generation Intelligence Evaluation of State-of-the-art Text-to-Image Models
Abstract:
With the rapid development of diffusion models, text-to-image(T2I) models have made significant progress, showcasing impressive abilities in prompt following and image generation. Recently launched models such as FLUX.1 and Ideogram2.0, along with others like Dall-E3 and Stable Diffusion 3, have demonstrated exceptional performance across various complex tasks, raising questions about whether T2I models are moving towards general-purpose applicability. Beyond traditional image generation, these models exhibit capabilities across a range of fields, including controllable generation, image editing, video, audio, 3D, and motion generation, as well as computer vision tasks like semantic segmentation and depth estimation. However, current evaluation frameworks are insufficient to comprehensively assess these models' performance across expanding domains. To thoroughly evaluate these models, we developed the IMAGINE-E and tested six prominent models: FLUX.1, Ideogram2.0, Midjourney, Dall-E3, Stable Diffusion 3, and Jimeng. Our evaluation is divided into five key domains: structured output generation, realism, and physical consistency, specific domain generation, challenging scenario generation, and multi-style creation tasks. This comprehensive assessment highlights each model's strengths and limitations, particularly the outstanding performance of FLUX.1 and Ideogram2.0 in structured and specific domain tasks, underscoring the expanding applications and potential of T2I models as foundational AI tools. This study provides valuable insights into the current state and future trajectory of T2I models as they evolve towards general-purpose usability. Evaluation scripts will be released at https://github.com/jylei16/Imagine-e.
中文:随着FLUX.1和Ideogram2.0等扩散模型的突破,文本到图像模型在多领域展现出通用化潜力,但现有评估体系尚不完善,为此开发的IMAGINE-E基准通过五大关键维度对六款主流模型进行了系统评估,揭示了其作为基础AI工具的发展前景。
English: Recent advances in diffusion-based text-to-image models like FLUX.1 and Ideogram2.0 demonstrate expanding capabilities across multiple domains, though current evaluation frameworks remain inadequate for comprehensive assessment, prompting the development of the IMAGINE-E benchmark to systematically evaluate six leading models across five key domains.

Authors:Peiyuan Zhang, Junwei Luo, Xue Yang, Yi Yu, Qingyun Li, Yue Zhou, Xiaosong Jia, Xudong Lu, Jingdong Chen, Xiang Li, Junchi Yan, Yansheng Li
Title: PointOBB-v3: Expanding Performance Boundaries of Single Point-Supervised Oriented Object Detection
Abstract:
With the growing demand for oriented object detection (OOD), recent studies on point-supervised OOD have attracted significant interest. In this paper, we propose PointOBB-v3, a stronger single point-supervised OOD framework. Compared to existing methods, it generates pseudo rotated boxes without additional priors and incorporates support for the end-to-end paradigm. PointOBB-v3 functions by integrating three unique image views: the original view, a resized view, and a rotated/flipped (rot/flp) view. Based on the views, a scale augmentation module and an angle acquisition module are constructed. In the first module, a Scale-Sensitive Consistency (SSC) loss and a Scale-Sensitive Feature Fusion (SSFF) module are introduced to improve the model's ability to estimate object scale. To achieve precise angle predictions, the second module employs symmetry-based self-supervised learning. Additionally, we introduce an end-to-end version that eliminates the pseudo-label generation process by integrating a detector branch and introduces an Instance-Aware Weighting (IAW) strategy to focus on high-quality predictions. We conducted extensive experiments on the DIOR-R, DOTA-v1.0/v1.5/v2.0, FAIR1M, STAR, and RSAR datasets. Across all these datasets, our method achieves an average improvement in accuracy of 3.56% in comparison to previous state-of-the-art methods. The code will be available at https://github.com/ZpyWHU/PointOBB-v3.
中文: PointOBB-v3是一种更强大的单点监督定向目标检测框架,无需额外先验即可生成伪旋转框,通过多视图集成和尺度角度增强模块,在多个数据集上平均精度提升3.56%。
English: PointOBB-v3 is an advanced single point-supervised oriented object detection framework that generates pseudo rotated boxes without extra priors, integrates multi-view processing with scale and angle enhancement modules, and achieves a 3.56% average accuracy improvement across multiple datasets.

Authors:Shiling Deng, Serge Belongie, Peter Ebert Christensen
Title: Large Vision-Language Models for Knowledge-Grounded Data Annotation of Memes
Abstract:
Memes have emerged as a powerful form of communication, integrating visual and textual elements to convey humor, satire, and cultural messages. Existing research has focused primarily on aspects such as emotion classification, meme generation, propagation, interpretation, figurative language, and sociolinguistics, but has often overlooked deeper meme comprehension and meme-text retrieval. To address these gaps, this study introduces ClassicMemes-50-templates (CM50), a large-scale dataset consisting of over 33,000 memes, centered around 50 popular meme templates. We also present an automated knowledge-grounded annotation pipeline leveraging large vision-language models to produce high-quality image captions, meme captions, and literary device labels overcoming the labor intensive demands of manual annotation. Additionally, we propose a meme-text retrieval CLIP model (mtrCLIP) that utilizes cross-modal embedding to enhance meme analysis, significantly improving retrieval performance. Our contributions include:(1) a novel dataset for large-scale meme study, (2) a scalable meme annotation framework, and (3) a fine-tuned CLIP for meme-text retrieval, all aimed at advancing the understanding and analysis of memes at scale.
中文: 本研究通过推出大规模CM50表情包数据集、自动化标注流程及优化的CLIP检索模型,解决了深度表情包理解的研究空白,推动了规模化表情包分析的发展。
English: This study introduces CM50, a large-scale meme dataset with automated annotations and a fine-tuned CLIP model for meme-text retrieval, addressing gaps in deep meme comprehension and advancing scalable meme analysis.

Authors:Frederik Pahde, Thomas Wiegand, Sebastian Lapuschkin, Wojciech Samek
Title: Ensuring Medical AI Safety: Interpretability-Driven Detection and Mitigation of Spurious Model Behavior and Associated Data
Abstract:
Deep neural networks are increasingly employed in high-stakes medical applications, despite their tendency for shortcut learning in the presence of spurious correlations, which can have potentially fatal consequences in practice. Whereas a multitude of works address either the detection or mitigation of such shortcut behavior in isolation, the Reveal2Revise approach provides a comprehensive bias mitigation framework combining these steps. However, effectively addressing these biases often requires substantial labeling efforts from domain experts. In this work, we review the steps of the Reveal2Revise framework and enhance it with semi-automated interpretability-based bias annotation capabilities. This includes methods for the sample- and feature-level bias annotation, providing valuable information for bias mitigation methods to unlearn the undesired shortcut behavior. We show the applicability of the framework using four medical datasets across two modalities, featuring controlled and real-world spurious correlations caused by data artifacts. We successfully identify and mitigate these biases in VGG16, ResNet50, and contemporary Vision Transformer models, ultimately increasing their robustness and applicability for real-world medical tasks. Our code is available at https://github.com/frederikpahde/medical-ai-safety.
中文:Reveal2Revise框架通过半自动化的基于可解释性的偏差标注功能得到增强,能有效识别和缓解医学AI模型中的捷径学习问题,从而提升了模型在多种数据集和架构中的鲁棒性。
English: The Reveal2Revise framework is enhanced with semi-automated interpretability-based bias annotation to effectively identify and mitigate shortcut learning in medical AI models, improving their robustness across various datasets and architectures.

Authors:Yizhe Lv, Tingting Zhang, Zhijian Wang, Yunpeng Song, Han Ding, Jinsong Han, Fei Wang
Title: mmEgoHand: Egocentric Hand Pose Estimation and Gesture Recognition with Head-mounted Millimeter-wave Radar and IMU
Abstract:
Recent advancements in millimeter-wave (mmWave) radar have demonstrated its potential for human action recognition and pose estimation, offering privacy-preserving advantages over conventional cameras while maintaining occlusion robustness, with promising applications in human-computer interaction and wellness care. However, existing mmWave systems typically employ fixed-position configurations, restricting user mobility to predefined zones and limiting practical deployment scenarios. We introduce mmEgoHand, a head-mounted egocentric system for hand pose estimation to support applications such as gesture recognition, VR interaction, skill digitization and assessment, and robotic teleoperation. mmEgoHand synergistically integrates mmWave radar with inertial measurement units (IMUs) to enable dynamic perception. The IMUs actively compensate for radar interference induced by head movements, while our novel end-to-end Transformer architecture simultaneously estimates 3D hand keypoint coordinates through multi-modal sensor fusion. This dual-modality framework achieves spatial-temporal alignment of mmWave heatmaps with IMUs, overcoming viewpoint instability inherent in egocentric sensing scenarios. We further demonstrate that intermediate hand pose representations substantially improve performance in downstream task, e.g., VR gesture recognition. Extensive evaluations with 10 subjects performing 8 gestures across 3 distinct postures -- standing, sitting, lying -- achieve 90.8% recognition accuracy, outperforming state-of-the-art solutions by a large margin. Dataset and code are available at https://github.com/WhisperYi/mmVR.
中文:mmEgoHand是一种头戴式系统,结合毫米波雷达和惯性测量单元实现动态手部姿态估计,通过新型Transformer架构和多模态融合,在不同姿势下实现了高精度的手势识别。
English: mmEgoHand is a head-mounted system combining mmWave radar and IMUs for dynamic hand pose estimation, achieving high accuracy in gesture recognition across various postures through a novel Transformer architecture and multi-modal fusion.

Authors:Zhi Sheng, Daisy Yuan, Jingtao Ding, Yong Li
Title: Unveiling the Power of Noise Priors: Enhancing Diffusion Models for Mobile Traffic Prediction
Abstract:
Accurate prediction of mobile traffic, i.e., network traffic from cellular base stations, is crucial for optimizing network performance and supporting urban development. However, the non-stationary nature of mobile traffic, driven by human activity and environmental changes, leads to both regular patterns and abrupt variations. Diffusion models excel in capturing such complex temporal dynamics due to their ability to capture the inherent uncertainties. Most existing approaches prioritize designing novel denoising networks but often neglect the critical role of noise itself, potentially leading to sub-optimal performance. In this paper, we introduce a novel perspective by emphasizing the role of noise in the denoising process. Our analysis reveals that noise fundamentally shapes mobile traffic predictions, exhibiting distinct and consistent patterns. We propose NPDiff, a framework that decomposes noise into prior and residual components, with the prior} derived from data dynamics, enhancing the model's ability to capture both regular and abrupt variations. NPDiff can seamlessly integrate with various diffusion-based prediction models, delivering predictions that are effective, efficient, and robust. Extensive experiments demonstrate that it achieves superior performance with an improvement over 30\%, offering a new perspective on leveraging diffusion models in this domain. We provide code and data at https://github.com/tsinghua-fib-lab/NPDiff.
中文摘要:本文提出NPDiff框架,通过将噪声分解为先验和残差分量来改进移动流量预测,能更有效地捕捉规律模式和突发变化,实验表明其性能提升超过30%。
English Summary: This paper introduces NPDiff, a novel diffusion-based framework that enhances mobile traffic prediction by decomposing noise into prior and residual components, achieving over 30% performance improvement through better modeling of both regular patterns and abrupt variations.

Authors:Dan Zhang, Tao Feng, Lilong Xue, Yuandong Wang, Yuxiao Dong, Jie Tang
Title: Parameter-Efficient Fine-Tuning for Foundation Models
Abstract:
This survey delves into the realm of Parameter-Efficient Fine-Tuning (PEFT) within the context of Foundation Models (FMs). PEFT, a cost-effective fine-tuning technique, minimizes parameters and computational complexity while striving for optimal downstream task performance. FMs, like ChatGPT, DALL-E, and LLaVA specialize in language understanding, generative tasks, and multimodal tasks, trained on diverse datasets spanning text, images, and videos. The diversity of FMs guides various adaptation strategies for PEFT. Therefore, this survey aims to provide a comprehensive overview of PEFT techniques applied to diverse FMs and address critical gaps in understanding the techniques, trends, and applications. We start by providing a detailed development of FMs and PEFT. Subsequently, we systematically review the key categories and core mechanisms of PEFT across diverse FMs to offer a comprehensive understanding of trends. We also explore the most recent applications across various FMs to demonstrate the versatility of PEFT, shedding light on the integration of systematic PEFT methods with a range of FMs. Furthermore, we identify potential research and development directions for improving PEFTs in the future. This survey provides a valuable resource for both newcomers and experts seeking to understand and use the power of PEFT across FMs. All reviewed papers are listed at \url{https://github.com/THUDM/Awesome-Parameter-Efficient-Fine-Tuning-for-Foundation-Models}.
本综述全面探讨了基础模型的参数高效微调技术,系统分析了其核心机制、应用场景与发展趋势,为相关研究者提供了重要参考。
This survey comprehensively reviews parameter-efficient fine-tuning techniques for foundation models, analyzing their mechanisms, applications, and future research directions to serve as a valuable resource for researchers.

Authors:Mingzhao Wang, You Zhou, Zhiguang Cao, Yubin Xiao, Xuan Wu, Wei Pang, Yuan Jiang, Hui Yang, Peng Zhao, Yuanshu Li
Title: An Efficient Diffusion-based Non-Autoregressive Solver for Traveling Salesman Problem
Abstract:
Recent advances in neural models have shown considerable promise in solving Traveling Salesman Problems (TSPs) without relying on much hand-crafted engineering. However, while non-autoregressive (NAR) approaches benefit from faster inference through parallelism, they typically deliver solutions of inferior quality compared to autoregressive ones. To enhance the solution quality while maintaining fast inference, we propose DEITSP, a diffusion model with efficient iterations tailored for TSP that operates in a NAR manner. Firstly, we introduce a one-step diffusion model that integrates the controlled discrete noise addition process with self-consistency enhancement, enabling optimal solution prediction through simultaneous denoising of multiple solutions. Secondly, we design a dual-modality graph transformer to bolster the extraction and fusion of features from node and edge modalities, while further accelerating the inference with fewer layers. Thirdly, we develop an efficient iterative strategy that alternates between adding and removing noise to improve exploration compared to previous diffusion methods. Additionally, we devise a scheduling framework to progressively refine the solution space by adjusting noise levels, facilitating a smooth search for optimal solutions. Extensive experiments on real-world and large-scale TSP instances demonstrate that DEITSP performs favorably against existing neural approaches in terms of solution quality, inference latency, and generalization ability. Our code is available at $\href{https://github.com/DEITSP/DEITSP}{https://github.com/DEITSP/DEITSP}$.
中文: 提出的DEITSP模型通过非自回归扩散方法和高效迭代策略,在旅行商问题中提升了求解质量,并在质量、速度和泛化能力上优于现有方法。
English: The proposed DEITSP model enhances solution quality for Traveling Salesman Problems through a non-autoregressive diffusion approach with efficient iterations, outperforming existing methods in quality, speed, and generalization.

Authors:Xin Xu, Jiaxin Zhang, Tianhao Chen, Zitong Chao, Jishan Hu, Can Yang
Title: UGMathBench: A Diverse and Dynamic Benchmark for Undergraduate-Level Mathematical Reasoning with Large Language Models
Abstract:
Large Language Models (LLMs) have made significant strides in mathematical reasoning, underscoring the need for a comprehensive and fair evaluation of their capabilities. However, existing benchmarks often fall short, either lacking extensive coverage of undergraduate-level mathematical problems or probably suffering from test-set contamination. To address these issues, we introduce UGMathBench, a diverse and dynamic benchmark specifically designed for evaluating undergraduate-level mathematical reasoning with LLMs. UGMathBench comprises 5,062 problems across 16 subjects and 111 topics, featuring 10 distinct answer types. Each problem includes three randomized versions, with additional versions planned for release as leading open-source LLMs become saturated in UGMathBench. Furthermore, we propose two key metrics: effective accuracy (EAcc), which measures the percentage of correctly solved problems across all three versions, and reasoning gap ($Δ$), which assesses reasoning robustness by calculating the difference between the average accuracy across all versions and EAcc. Our extensive evaluation of 23 leading LLMs reveals that the highest EAcc achieved is 56.3\% by OpenAI-o1-mini, with large $Δ$ values observed across different models. This highlights the need for future research aimed at developing "large reasoning models" with high EAcc and $Δ= 0$. We anticipate that the release of UGMathBench, along with its detailed evaluation codes, will serve as a valuable resource to advance the development of LLMs in solving mathematical problems. Codes and data are available at https://github.com/YangLabHKUST/UGMathBench
中文: UGMathBench被提出作为一个全面的基准测试,旨在评估大型语言模型在本科数学推理上的能力,通过多样化问题和有效准确率、推理差距等新指标弥补现有不足,评估结果显示模型性能仍有很大提升空间。
English: UGMathBench is introduced as a comprehensive benchmark to evaluate LLMs' undergraduate-level mathematical reasoning, addressing gaps in existing benchmarks through diverse problems and new metrics like effective accuracy and reasoning gap, with evaluations revealing significant room for improvement in model performance.

Authors:Dario Serez, Marco Cristani, Alessio Del Bue, Vittorio Murino, Pietro Morerio
Title: A Mutual Information Perspective on Multiple Latent Variable Generative Models for Positive View Generation
Abstract:
In image generation, Multiple Latent Variable Generative Models (MLVGMs) employ multiple latent variables to gradually shape the final images, from global characteristics to finer and local details (e.g., StyleGAN, NVAE), emerging as powerful tools for diverse applications. Yet their generative dynamics remain only empirically observed, without a systematic understanding of each latent variable's impact. In this work, we propose a novel framework that quantifies the contribution of each latent variable using Mutual Information (MI) as a metric. Our analysis reveals that current MLVGMs often underutilize some latent variables, and provides actionable insights for their use in downstream applications. With this foundation, we introduce a method for generating synthetic data for Self-Supervised Contrastive Representation Learning (SSCRL). By leveraging the hierarchical and disentangled variables of MLVGMs, our approach produces diverse and semantically meaningful views without the need for real image data. Additionally, we introduce a Continuous Sampling (CS) strategy, where the generator dynamically creates new samples during SSCRL training, greatly increasing data variability. Our comprehensive experiments demonstrate the effectiveness of these contributions, showing that MLVGMs' generated views compete on par with or even surpass views generated from real data. This work establishes a principled approach to understanding and exploiting MLVGMs, advancing both generative modeling and self-supervised learning. Code and pre-trained models at: https://github.com/SerezD/mi_ml_gen.
中文摘要:本研究提出一种基于互信息的框架来量化多潜在变量生成模型中各变量的作用,发现其利用不足的问题,并开发了无需真实图像即可生成媲美真实数据的自监督学习合成数据方法。
English Summary: This study introduces a framework to quantify the impact of latent variables in MLVGMs using mutual information, revealing their underutilization and enabling synthetic data generation for self-supervised learning that rivals real data performance.

Authors:Abdulrahman Oladipupo Ibraheem
Title: Regularizing cross entropy loss via minimum entropy and K-L divergence
Abstract:
I introduce two novel loss functions for classification in deep learning. The two loss functions extend standard cross entropy loss by regularizing it with minimum entropy and Kullback-Leibler (K-L) divergence terms. The first of the two novel loss functions is termed mixed entropy loss (MIX-ENT for short), while the second one is termed minimum entropy regularized cross-entropy loss (MIN-ENT for short). The MIX-ENT function introduces a regularizer that can be shown to be equivalent to the sum of a minimum entropy term and a K-L divergence term. However, it should be noted that the K-L divergence term here is different from that in the standard cross-entropy loss function, in the sense that it swaps the roles of the target probability and the hypothesis probability. The MIN-ENT function simply adds a minimum entropy regularizer to the standard cross entropy loss function. In both MIX-ENT and MIN-ENT, the minimum entropy regularizer minimizes the entropy of the hypothesis probability distribution which is output by the neural network. Experiments on the EMNIST-Letters dataset shows that my implementation of MIX-ENT and MIN-ENT lets the VGG model climb from its previous 3rd position on the paperswithcode leaderboard to reach the 2nd position on the leaderboard, outperforming the Spinal-VGG model in so doing. Specifically, using standard cross-entropy, VGG achieves 95.86% while Spinal-VGG achieves 95.88% classification accuracies, whereas using VGG (without Spinal-VGG) our MIN-ENT achieved 95.933%, while our MIX-ENT achieved 95.927% accuracies. The pre-trained models for both MIX-ENT and MIN-ENT are at https://github.com/rahmanoladi/minimum entropy project.
Chinese: 本文提出了两种新的损失函数MIX-ENT和MIN-ENT,通过引入最小熵和K-L散度正则化改进标准交叉熵,在EMNIST-Letters数据集上提升了VGG模型的分类准确率,使其在排行榜上超越原有排名。
English: This paper introduces two novel loss functions, MIX-ENT and MIN-ENT, which enhance standard cross-entropy by incorporating minimum entropy and K-L divergence regularization, leading to improved classification accuracy on the EMNIST-Letters dataset and advancing VGG's leaderboard position.

Authors:Fu Rong, Meng Lan, Qian Zhang, Lefei Zhang
Title: MPG-SAM 2: Adapting SAM 2 with Mask Priors and Global Context for Referring Video Object Segmentation
Abstract:
Referring video object segmentation (RVOS) aims to segment objects in a video according to textual descriptions, which requires the integration of multimodal information and temporal dynamics perception. The Segment Anything Model 2 (SAM 2) has shown great effectiveness across various video segmentation tasks. However, its application to offline RVOS is challenged by the translation of the text into effective prompts and a lack of global context awareness. In this paper, we propose a novel RVOS framework, termed MPG-SAM 2, to address these challenges. Specifically, MPG-SAM 2 employs a unified multimodal encoder to jointly encode video and textual features, generating semantically aligned video and text embeddings, along with multimodal class tokens. A mask prior generator utilizes the video embeddings and class tokens to create pseudo masks of target objects and global context. These masks are fed into the prompt encoder as dense prompts along with multimodal class tokens as sparse prompts to generate accurate prompts for SAM 2. To provide the online SAM 2 with a global view, we introduce a hierarchical global-historical aggregator, which allows SAM 2 to aggregate global and historical information of target objects at both pixel and object levels, enhancing the target representation and temporal consistency. Extensive experiments on several RVOS benchmarks demonstrate the superiority of MPG-SAM 2 and the effectiveness of our proposed modules. The code is available at https://github.com/rongfu-dsb/MPG-SAM2.
中文:提出的MPG-SAM 2框架通过多模态编码器实现语义对齐,并采用分层聚合器获取全局上下文,显著提升了离线参考视频对象分割的性能,在多个基准测试中表现优异。
English: The proposed MPG-SAM 2 framework enhances offline referring video object segmentation by integrating multimodal encoders for semantic alignment and a hierarchical aggregator for global context, achieving superior performance on benchmarks.

Authors:Olaya Pérez-Mon, Juan José del Coz, Pablo González
Title: Quantification via Gaussian Latent Space Representations
Abstract:
Quantification, or prevalence estimation, is the task of predicting the prevalence of each class within an unknown bag of examples. Most existing quantification methods in the literature rely on prior probability shift assumptions to create a quantification model that uses the predictions of an underlying classifier to make optimal prevalence estimates. In this work, we present an end-to-end neural network that uses Gaussian distributions in latent spaces to obtain invariant representations of bags of examples. This approach addresses the quantification problem using deep learning, enabling the optimization of specific loss functions relevant to the problem and avoiding the need for an intermediate classifier, tackling the quantification problem as a direct optimization problem. Our method achieves state-of-the-art results, both against traditional quantification methods and other deep learning approaches for quantification. The code needed to reproduce all our experiments is publicly available at https://github.com/AICGijon/gmnet.
Chinese: 本文提出了一种端到端的神经网络,利用潜在空间中的高斯分布直接优化量化任务,无需依赖中间分类器即可实现最先进的成果。
English: This paper introduces an end-to-end neural network that leverages Gaussian distributions in latent spaces to directly optimize quantification tasks, achieving state-of-the-art results without relying on intermediate classifiers.

Authors:Qiang Hu, Qihan He, Houqiang Zhong, Guo Lu, Xiaoyun Zhang, Guangtao Zhai, Yanfeng Wang
Title: VARFVV: View-Adaptive Real-Time Interactive Free-View Video Streaming with Edge Computing
Abstract:
Free-view video (FVV) allows users to explore immersive video content from multiple views. However, delivering FVV poses significant challenges due to the uncertainty in view switching, combined with the substantial bandwidth and computational resources required to transmit and decode multiple video streams, which may result in frequent playback interruptions. Existing approaches, either client-based or cloud-based, struggle to meet high Quality of Experience (QoE) requirements under limited bandwidth and computational resources. To address these issues, we propose VARFVV, a bandwidth- and computationally-efficient system that enables real-time interactive FVV streaming with high QoE and low switching delay. Specifically, VARFVV introduces a low-complexity FVV generation scheme that reassembles multiview video frames at the edge server based on user-selected view tracks, eliminating the need for transcoding and significantly reducing computational overhead. This design makes it well-suited for large-scale, mobile-based UHD FVV experiences. Furthermore, we present a popularity-adaptive bit allocation method, leveraging a graph neural network, that predicts view popularity and dynamically adjusts bit allocation to maximize QoE within bandwidth constraints. We also construct an FVV dataset comprising 330 videos from 10 scenes, including basketball, opera, etc. Extensive experiments show that VARFVV surpasses existing methods in video quality, switching latency, computational efficiency, and bandwidth usage, supporting over 500 users on a single edge server with a switching delay of 71.5ms. Our code and dataset are available at https://github.com/qianghu-huber/VARFVV.
中文摘要:VARFVV是一种带宽和计算效率高的系统,通过在边缘服务器重组多视角视频帧并采用基于流行度的自适应码率分配,实现了高质量体验和低切换延迟的实时交互式自由视角视频流传输。
English Summary: VARFVV is a bandwidth- and computationally-efficient system that enables real-time interactive free-view video streaming with high quality of experience and low switching delay by reassembling multiview frames at the edge server and using popularity-adaptive bit allocation.

Authors:Younes Yousef, Lukas Galke, Ansgar Scherp
Title: A Transformer-based Autoregressive Decoder Architecture for Hierarchical Text Classification
Abstract:
Recent approaches in hierarchical text classification (HTC) rely on the capabilities of a pre-trained transformer model and exploit the label semantics and a graph encoder for the label hierarchy. In this paper, we introduce an effective hierarchical text classifier RADAr (Transformer-based Autoregressive Decoder Architecture) that is based only on an off-the-shelf RoBERTa transformer to process the input and a custom autoregressive decoder with two decoder layers for generating the classification output. Thus, unlike existing approaches for HTC, the encoder of RADAr has no explicit encoding of the label hierarchy and the decoder solely relies on the label sequences of the samples observed during training. We demonstrate on three benchmark datasets that RADAr achieves results competitive to the state of the art with less training and inference time. Our model consistently performs better when organizing the label sequences from children to parents versus the inverse, as done in existing HTC approaches. Our experiments show that neither the label semantics nor an explicit graph encoder for the hierarchy is needed. This has strong practical implications for HTC as the architecture has fewer requirements and provides a speed-up by a factor of 2 at inference time. Moreover, training a separate decoder from scratch in conjunction with fine-tuning the encoder allows future researchers and practitioners to exchange the encoder part as new models arise. The source code is available at https://github.com/yousef-younes/RADAr.
Chinese: RADAr模型提出了一种基于Transformer的层次文本分类器,采用标准RoBERTa编码器和自定义自回归解码器,无需显式标签层次编码即可达到先进水平,同时将推理时间减半。
English: The RADAr model introduces a transformer-based hierarchical text classifier that uses a standard RoBERTa encoder and a custom autoregressive decoder, achieving competitive results without explicit label hierarchy encoding while reducing inference time by half.

Authors:Tao Liu, Kai Wang, Senmao Li, Joost van de Weijer, Fahad Shahbaz Khan, Shiqi Yang, Yaxing Wang, Jian Yang, Ming-Ming Cheng
Title: One-Prompt-One-Story: Free-Lunch Consistent Text-to-Image Generation Using a Single Prompt
Abstract:
Text-to-image generation models can create high-quality images from input prompts. However, they struggle to support the consistent generation of identity-preserving requirements for storytelling. Existing approaches to this problem typically require extensive training in large datasets or additional modifications to the original model architectures. This limits their applicability across different domains and diverse diffusion model configurations. In this paper, we first observe the inherent capability of language models, coined context consistency, to comprehend identity through context with a single prompt. Drawing inspiration from the inherent context consistency, we propose a novel training-free method for consistent text-to-image (T2I) generation, termed "One-Prompt-One-Story" (1Prompt1Story). Our approach 1Prompt1Story concatenates all prompts into a single input for T2I diffusion models, initially preserving character identities. We then refine the generation process using two novel techniques: Singular-Value Reweighting and Identity-Preserving Cross-Attention, ensuring better alignment with the input description for each frame. In our experiments, we compare our method against various existing consistent T2I generation approaches to demonstrate its effectiveness through quantitative metrics and qualitative assessments. Code is available at https://github.com/byliutao/1Prompt1Story.
Chinese Summary: 本文提出“1Prompt1Story”方法,通过合并提示词并采用奇异值重加权和身份保持交叉注意力技术,无需训练即可实现一致性文本到图像生成,在实验中优于现有方法。
English Summary: The paper introduces "1Prompt1Story," a training-free method that enhances consistent text-to-image generation by combining prompts and refining outputs with Singular-Value Reweighting and Identity-Preserving Cross-Attention, outperforming existing approaches.

Authors:Chenxu Wu, Qingpeng Kong, Zihang Jiang, S. Kevin Zhou
Title: Self-Supervised Diffusion MRI Denoising via Iterative and Stable Refinement
Abstract:
Magnetic Resonance Imaging (MRI), including diffusion MRI (dMRI), serves as a ``microscope'' for anatomical structures and routinely mitigates the influence of low signal-to-noise ratio scans by compromising temporal or spatial resolution. However, these compromises fail to meet clinical demands for both efficiency and precision. Consequently, denoising is a vital preprocessing step, particularly for dMRI, where clean data is unavailable. In this paper, we introduce Di-Fusion, a fully self-supervised denoising method that leverages the latter diffusion steps and an adaptive sampling process. Unlike previous approaches, our single-stage framework achieves efficient and stable training without extra noise model training and offers adaptive and controllable results in the sampling process. Our thorough experiments on real and simulated data demonstrate that Di-Fusion achieves state-of-the-art performance in microstructure modeling, tractography tracking, and other downstream tasks. Code is available at https://github.com/FouierL/Di-Fusion.
中文: Di-Fusion是一种无需额外噪声训练的自监督去噪方法,通过自适应采样提升MRI效率与精度,在微观结构建模和纤维束追踪中表现卓越。
English: Di-Fusion is a self-supervised denoising method that improves MRI efficiency and precision without extra noise training, achieving top performance in microstructure modeling and tractography.

Authors:Xuerui Qiu, Malu Zhang, Jieyuan Zhang, Wenjie Wei, Honglin Cao, Junsheng Guo, Rui-Jie Zhu, Yimeng Shan, Yang Yang, Haizhou Li
Title: Quantized Spike-driven Transformer
Abstract:
Spiking neural networks are emerging as a promising energy-efficient alternative to traditional artificial neural networks due to their spike-driven paradigm. However, recent research in the SNN domain has mainly focused on enhancing accuracy by designing large-scale Transformer structures, which typically rely on substantial computational resources, limiting their deployment on resource-constrained devices. To overcome this challenge, we propose a quantized spike-driven Transformer baseline (QSD-Transformer), which achieves reduced resource demands by utilizing a low bit-width parameter. Regrettably, the QSD-Transformer often suffers from severe performance degradation. In this paper, we first conduct empirical analysis and find that the bimodal distribution of quantized spike-driven self-attention (Q-SDSA) leads to spike information distortion (SID) during quantization, causing significant performance degradation. To mitigate this issue, we take inspiration from mutual information entropy and propose a bi-level optimization strategy to rectify the information distribution in Q-SDSA. Specifically, at the lower level, we introduce an information-enhanced LIF to rectify the information distribution in Q-SDSA. At the upper level, we propose a fine-grained distillation scheme for the QSD-Transformer to align the distribution in Q-SDSA with that in the counterpart ANN. By integrating the bi-level optimization strategy, the QSD-Transformer can attain enhanced energy efficiency without sacrificing its high-performance advantage. For instance, when compared to the prior SNN benchmark on ImageNet, the QSD-Transformer achieves 80.3% top-1 accuracy, accompanied by significant reductions of 6.0$\times$ and 8.1$\times$ in power consumption and model size, respectively. Code is available at https://github.com/bollossom/QSD-Transformer.
中文: 本文提出量化脉冲驱动Transformer(QSD-Transformer),通过双级优化策略解决脉冲信息失真导致的性能下降问题,在ImageNet上实现了高精度与显著提升的能效。
English: This paper proposes a quantized spike-driven Transformer (QSD-Transformer) that addresses performance degradation caused by spike information distortion through a bi-level optimization strategy, achieving enhanced energy efficiency and high accuracy on ImageNet.

Authors:Yuliang Gu, Weilun Tsao, Bo Du, Thierry Géraud, Yongchao Xu
Title: Leveraging Textual Anatomical Knowledge for Class-Imbalanced Semi-Supervised Multi-Organ Segmentation
Abstract:
Annotating 3D medical images demands substantial time and expertise, driving the adoption of semi-supervised learning (SSL) for segmentation tasks. However, the complex anatomical structures of organs often lead to significant class imbalances, posing major challenges for deploying SSL in real-world scenarios. Despite the availability of valuable prior information, such as inter-organ relative positions and organ shape priors, existing SSL methods have yet to fully leverage these insights. To address this gap, we propose a novel approach that integrates textual anatomical knowledge (TAK) into the segmentation model. Specifically, we use GPT-4o to generate textual descriptions of anatomical priors, which are then encoded using a CLIP-based model. These encoded priors are injected into the segmentation model as parameters of the segmentation head. Additionally, contrastive learning is employed to enhance the alignment between textual priors and visual features. Extensive experiments demonstrate the superior performance of our method, significantly surpassing state-of-the-art approaches. The source code will be available at: https://github.com/Lunn88/TAK-Semi.
中文摘要:本研究提出了一种新颖的半监督学习方法,通过CLIP编码和对比学习将GPT-4o生成的文本解剖知识融入医学图像分割,显著超越了现有方法的性能表现。
English Summary: This study introduces a novel semi-supervised learning method that integrates GPT-4o-generated textual anatomical knowledge into medical image segmentation through CLIP encoding and contrastive learning, significantly outperforming existing approaches.

Authors:Haomiao Xiong, Zongxin Yang, Jiazuo Yu, Yunzhi Zhuge, Lu Zhang, Jiawen Zhu, Huchuan Lu
Title: Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge
Abstract:
Recent advances in Large Language Models (LLMs) have enabled the development of Video-LLMs, advancing multimodal learning by bridging video data with language tasks. However, current video understanding models struggle with processing long video sequences, supporting multi-turn dialogues, and adapting to real-world dynamic scenarios. To address these issues, we propose StreamChat, a training-free framework for streaming video reasoning and conversational interaction. $\StreamChat$ leverages a novel hierarchical memory system to efficiently process and compress video features over extended sequences, enabling real-time, multi-turn dialogue. Our framework incorporates a parallel system scheduling strategy that enhances processing speed and reduces latency, ensuring robust performance in real-world applications. Furthermore, we introduce StreamBench, a versatile benchmark that evaluates streaming video understanding across diverse media types and interactive scenarios, including multi-turn interactions and complex reasoning tasks. Extensive evaluations on StreamBench and other public benchmarks demonstrate that StreamChat significantly outperforms existing state-of-the-art models in terms of accuracy and response times, confirming its effectiveness for streaming video understanding. Code is available at StreamChat: https://github.com/hmxiong/StreamChat.
中文摘要:StreamChat是一种无需训练的框架,通过分层记忆系统和并行调度技术,实现了流媒体视频的实时多轮对话理解,在准确性和响应速度上显著优于现有先进模型。
English Summary: StreamChat is a training-free framework that uses a hierarchical memory system and parallel scheduling to enable real-time, multi-turn dialogue for streaming video understanding, outperforming existing models in accuracy and speed.

Authors:Andong Li, Zhihang Sun, Fengyuan Hao, Xiaodong Li, Chengshi Zheng
Title: Neural Vocoders as Speech Enhancers
Abstract:
Speech enhancement (SE) and neural vocoding are traditionally viewed as separate tasks. In this work, we observe them under a common thread: the rank behavior of these processes. This observation prompts two key questions: \textit{Can a model designed for one task's rank degradation be adapted for the other?} and \textit{Is it possible to address both tasks using a unified model?} Our empirical findings demonstrate that existing speech enhancement models can be successfully trained to perform vocoding tasks, and a single model, when jointly trained, can effectively handle both tasks with performance comparable to separately trained models. These results suggest that speech enhancement and neural vocoding can be unified under a broader framework of speech restoration. Code: https://github.com/Andong-Li-speech/Neural-Vocoders-as-Speech-Enhancers.
中文: 本研究通过语音增强和神经声码器在秩行为上的共性,证明单一联合训练模型可同时胜任两项任务,且性能与专门模型相当。
English: This study demonstrates that speech enhancement and neural vocoding can be unified through their shared rank behavior, enabling a single jointly-trained model to perform both tasks comparably to specialized models.

Authors:Samer Attrah
Title: Emotion estimation from video footage with LSTM
Abstract:
Emotion estimation in general is a field that has been studied for a long time, and several approaches exist using machine learning. in this paper, we present an LSTM model, that processes the blend-shapes produced by the library MediaPipe, for a face detected in a live stream of a camera, to estimate the main emotion from the facial expressions, this model is trained on the FER2013 dataset and delivers a result of 71% accuracy and 62% f1-score which meets the accuracy benchmark of the FER2013 dataset, with significantly reduced computation costs. https://github.com/Samir-atra/Emotion_estimation_from_video_footage_with_LSTM_ML_algorithm
中文: 本文提出一种LSTM模型,通过处理实时视频中MediaPipe生成的面部混合形状来估算情绪,在FER2013数据集上达到71%准确率和62% F1分数,且显著降低了计算成本。
English: This paper introduces an LSTM model that analyzes MediaPipe-generated facial blend-shapes from live video to estimate emotions, achieving 71% accuracy and 62% F1-score on the FER2013 dataset with lower computational costs.

Authors:Jian Wang, Xiaokang Zhang, Xianping Ma, Weikang Yu, Pedram Ghamisi
Title: Auto-Prompting SAM for Weakly Supervised Landslide Extraction
Abstract:
Weakly supervised landslide extraction aims to identify landslide regions from remote sensing data using models trained with weak labels, particularly image-level labels. However, it is often challenged by the imprecise boundaries of the extracted objects due to the lack of pixel-wise supervision and the properties of landslide objects. To tackle these issues, we propose a simple yet effective method by auto-prompting the Segment Anything Model (SAM), i.e., APSAM. Instead of depending on high-quality class activation maps (CAMs) for pseudo-labeling or fine-tuning SAM, our method directly yields fine-grained segmentation masks from SAM inference through prompt engineering. Specifically, it adaptively generates hybrid prompts from the CAMs obtained by an object localization network. To provide sufficient information for SAM prompting, an adaptive prompt generation (APG) algorithm is designed to fully leverage the visual patterns of CAMs, enabling the efficient generation of pseudo-masks for landslide extraction. These informative prompts are able to identify the extent of landslide areas (box prompts) and denote the centers of landslide objects (point prompts), guiding SAM in landslide segmentation. Experimental results on high-resolution aerial and satellite datasets demonstrate the effectiveness of our method, achieving improvements of at least 3.0\% in F1 score and 3.69\% in IoU compared to other state-of-the-art methods. The source codes and datasets will be available at https://github.com/zxk688.
中文: 本研究提出APSAM方法,通过从类别激活图中自适应生成混合提示来引导Segment Anything模型,有效提升了弱监督滑坡提取在遥感数据上的分割精度。
English: This study introduces APSAM, a novel method that enhances weakly supervised landslide extraction by adaptively generating hybrid prompts from class activation maps to guide the Segment Anything Model, achieving significant improvements in segmentation accuracy on remote sensing datasets.

Authors:Jinghan You, Shanglin Li, Yuanrui Sun, Jiangchuan Wei, Mingyu Guo, Chao Feng, Jiao Ran
Title: LVFace: Progressive Cluster Optimization for Large Vision Models in Face Recognition
Abstract:
Vision Transformers (ViTs) have revolutionized large-scale visual modeling, yet remain underexplored in face recognition (FR) where CNNs still dominate. We identify a critical bottleneck: CNN-inspired training paradigms fail to unlock ViT's potential, leading to suboptimal performance and convergence instability.To address this challenge, we propose LVFace, a ViT-based FR model that integrates Progressive Cluster Optimization (PCO) to achieve superior results. Specifically, PCO sequentially applies negative class sub-sampling (NCS) for robust and fast feature alignment from random initialization, feature expectation penalties for centroid stabilization, performing cluster boundary refinement through full-batch training without NCS constraints. LVFace establishes a new state-of-the-art face recognition baseline, surpassing leading approaches such as UniFace and TopoFR across multiple benchmarks. Extensive experiments demonstrate that LVFace delivers consistent performance gains, while exhibiting scalability to large-scale datasets and compatibility with mainstream VLMs and LLMs. Notably, LVFace secured 1st place in the ICCV 2021 Masked Face Recognition (MFR)-Ongoing Challenge (March 2025), proving its efficacy in real-world scenarios. Project is available at https://github.com/bytedance/LVFace.
中文: LVFace是一种基于视觉变换器的人脸识别模型,通过渐进式聚类优化克服训练瓶颈,在多个基准测试中创下新纪录并赢得ICCV 2021 MFR挑战赛,展现了卓越性能和可扩展性。
English: LVFace, a Vision Transformer-based face recognition model with Progressive Cluster Optimization, overcomes training bottlenecks to set new benchmarks and win the ICCV 2021 MFR Challenge, demonstrating superior performance and scalability.

Authors:Yiming Tang, Abrar Anwar, Jesse Thomason
Title: M3PT: A Transformer for Multimodal, Multi-Party Social Signal Prediction with Person-aware Blockwise Attention
Abstract:
Understanding social signals in multi-party conversations is important for human-robot interaction and artificial social intelligence. Social signals include body pose, head pose, speech, and context-specific activities like acquiring and taking bites of food when dining. Past work in multi-party interaction tends to build task-specific models for predicting social signals. In this work, we address the challenge of predicting multimodal social signals in multi-party settings in a single model. We introduce M3PT, a causal transformer architecture with modality and temporal blockwise attention masking to simultaneously process multiple social cues across multiple participants and their temporal interactions. We train and evaluate M3PT on the Human-Human Commensality Dataset (HHCD), and demonstrate that using multiple modalities improves bite timing and speaking status prediction. Source code: https://github.com/AbrarAnwar/masked-social-signals/.
中文摘要:本研究提出M3PT模型,通过处理多方交互中的多模态社交线索及其时序关联,统一预测社交信号,在HHCD数据集上实现了进食时机和说话状态预测性能的提升。
English Summary: This research introduces M3PT, a unified transformer model that predicts multimodal social signals in multi-party interactions by processing multiple cues across participants and time, showing improved performance in bite timing and speaking status prediction on the HHCD dataset.

Authors:Zhaoxuan Tan, Zinan Zeng, Qingkai Zeng, Zhenyu Wu, Zheyuan Liu, Fengran Mo, Meng Jiang
Title: Can Large Language Models Understand Preferences in Personalized Recommendation?
Abstract:
Large Language Models (LLMs) excel in various tasks, including personalized recommendations. Existing evaluation methods often focus on rating prediction, relying on regression errors between actual and predicted ratings. However, user rating bias and item quality, two influential factors behind rating scores, can obscure personal preferences in user-item pair data. To address this, we introduce PerRecBench, disassociating the evaluation from these two factors and assessing recommendation techniques on capturing the personal preferences in a grouped ranking manner. We find that the LLM-based recommendation techniques that are generally good at rating prediction fail to identify users' favored and disfavored items when the user rating bias and item quality are eliminated by grouping users. With PerRecBench and 19 LLMs, we find that while larger models generally outperform smaller ones, they still struggle with personalized recommendation. Our findings reveal the superiority of pairwise and listwise ranking approaches over pointwise ranking, PerRecBench's low correlation with traditional regression metrics, the importance of user profiles, and the role of pretraining data distributions. We further explore three supervised fine-tuning strategies, finding that merging weights from single-format training is promising but improving LLMs' understanding of user preferences remains an open research problem. Code and data are available at https://github.com/TamSiuhin/PerRecBench
中文摘要:PerRecBench通过消除用户评分偏差和物品质量影响来评估基于大语言模型的推荐技术,发现尽管较大模型表现更优,但在个性化推荐上仍有困难,且排序方法优于逐点评分法。
English Summary: PerRecBench is introduced to evaluate LLM-based recommendation techniques by isolating user rating bias and item quality, revealing that while larger models perform better, they still struggle with personalized recommendations and that ranking approaches are superior to pointwise methods.

Authors:Zhiyuan Weng, Guikun Chen, Wenguan Wang
Title: Do as We Do, Not as You Think: the Conformity of Large Language Models
Abstract:
Recent advancements in large language models (LLMs) revolutionize the field of intelligent agents, enabling collaborative multi-agent systems capable of tackling complex problems across various domains. However, the potential of conformity within these systems, analogous to phenomena like conformity bias and groupthink in human group dynamics, remains largely unexplored, raising concerns about their collective problem-solving capabilities and possible ethical implications. This paper presents a comprehensive study on conformity in LLM-driven multi-agent systems, focusing on three aspects: the existence of conformity, the factors influencing conformity, and potential mitigation strategies. In particular, we introduce BenchForm, a new conformity-oriented benchmark, featuring reasoning-intensive tasks and five distinct interaction protocols designed to probe LLMs' behavior in collaborative scenarios. Several representative LLMs are evaluated on BenchForm, using metrics such as conformity rate and independence rate to quantify conformity's impact. Our analysis delves into factors influencing conformity, including interaction time and majority size, and examines how the subject agent rationalizes its conforming behavior. Furthermore, we explore two strategies to mitigate conformity effects, i.e., developing enhanced personas and implementing a reflection mechanism. Several interesting findings regarding LLMs' conformity are derived from empirical results and case studies. We hope that these insights can pave the way for more robust and ethically-aligned collaborative AI systems. Our benchmark and code are available at BenchForm.
中文摘要:本研究通过开发BenchForm基准测试,系统探讨了大型语言模型驱动的多智能体系统中的从众现象,分析了其存在性、影响因素及通过增强角色设定和反思机制等缓解策略。
English Summary: This study investigates conformity in LLM-driven multi-agent systems, developing the BenchForm benchmark to analyze its existence, influencing factors, and mitigation strategies like enhanced personas and reflection mechanisms.

Authors:Gabrielle Hoyer, Michelle W Tong, Rupsa Bhattacharjee, Valentina Pedoia, Sharmila Majumdar
Title: Clinical Utility of Foundation Segmentation Models in Musculoskeletal MRI: Biomarker Fidelity and Predictive Outcomes
Abstract:
Effective segmentation is fundamental for quantitative medical imaging; however, foundation segmentation models remain insufficiently evaluated for accuracy and biomarker fidelity across the diverse anatomical contexts and imaging protocols encountered in musculoskeletal (MSK) MRI. We evaluate three widely used segmentation models (SAM, SAM2, MedSAM) across eleven MSK MRI datasets spanning the knee, hip, spine, shoulder, and thigh. Our framework assesses both zero-shot and finetuned performance, with attention to segmentation accuracy, generalizability across imaging protocols, and reliability of derived quantitative biomarkers. Finetuned models showed consistent agreement with expert measurements for biomarkers including cartilage thickness, disc height, muscle volume, and compositional T1rho/T2 values. Automated prompting through the AutoLabel system enabled scalable segmentation, with moderate trade-offs in accuracy. As proof of concept, we applied the validated system to (i) a three-stage knee MRI triage cascade and (ii) a longitudinal landmark model that predicts total knee replacement and incident osteoarthritis. The framework offers a transparent method for benchmarking segmentation tools and connecting model performance to clinical imaging priorities.
中文摘要:本研究在多种肌肉骨骼MRI数据集上评估三种分割模型,发现微调后的模型能获得可靠的生物标志物测量结果,并通过膝关节MRI分诊和骨关节炎预测系统验证了其临床应用价值。
English Summary: This study evaluates three segmentation models across diverse musculoskeletal MRI datasets, finding that fine-tuned models achieve reliable biomarker measurements and demonstrating their clinical applicability through knee MRI triage and osteoarthritis prediction systems.

Authors:Peirong Liu, Ana Lawry Aguila, Juan E. Iglesias
Title: Unraveling Normal Anatomy via Fluid-Driven Anomaly Randomization
Abstract:
Data-driven machine learning has made significant strides in medical image analysis. However, most existing methods are tailored to specific modalities and assume a particular resolution (often isotropic). This limits their generalizability in clinical settings, where variations in scan appearance arise from differences in sequence parameters, resolution, and orientation. Furthermore, most general-purpose models are designed for healthy subjects and suffer from performance degradation when pathology is present. We introduce UNA (Unraveling Normal Anatomy), the first modality-agnostic learning approach for normal brain anatomy reconstruction that can handle both healthy scans and cases with pathology. We propose a fluid-driven anomaly randomization method that generates an unlimited number of realistic pathology profiles on-the-fly. UNA is trained on a combination of synthetic and real data, and can be applied directly to real images with potential pathology without the need for fine-tuning. We demonstrate UNA's effectiveness in reconstructing healthy brain anatomy and showcase its direct application to anomaly detection, using both simulated and real images from 3D healthy and stroke datasets, including CT and MRI scans. By bridging the gap between healthy and diseased images, UNA enables the use of general-purpose models on diseased images, opening up new opportunities for large-scale analysis of uncurated clinical images in the presence of pathology. Code is available at https://github.com/peirong26/UNA.
中文: UNA提出了一种与模态无关的正常脑解剖重建方法,通过流体驱动的异常随机化技术,无需微调即可直接应用于包含病理的真实图像,有效处理健康和病变扫描。
English: UNA introduces a modality-agnostic approach for reconstructing normal brain anatomy, capable of handling both healthy and pathological scans through fluid-driven anomaly randomization and direct application to real images without fine-tuning.

Authors:Yongxiang Liu, Weijie Li, Li Liu, Jie Zhou, Bowen Peng, Yafei Song, Xuying Xiong, Wei Yang, Tianpeng Liu, Zhen Liu, Xiang Li
Title: ATRNet-STAR: A Large Dataset and Benchmark Towards Remote Sensing Object Recognition in the Wild
Abstract:
The absence of publicly available, large-scale, high-quality datasets for Synthetic Aperture Radar Automatic Target Recognition (SAR ATR) has significantly hindered the application of rapidly advancing deep learning techniques, which hold huge potential to unlock new capabilities in this field. This is primarily because collecting large volumes of diverse target samples from SAR images is prohibitively expensive, largely due to privacy concerns, the characteristics of microwave radar imagery perception, and the need for specialized expertise in data annotation. Throughout the history of SAR ATR research, there have been only a number of small datasets, mainly including targets like ships, airplanes, buildings, etc. There is only one vehicle dataset MSTAR collected in the 1990s, which has been a valuable source for SAR ATR. To fill this gap, this paper introduces a large-scale, new dataset named ATRNet-STAR with 40 different vehicle categories collected under various realistic imaging conditions and scenes. It marks a substantial advancement in dataset scale and diversity, comprising over 190,000 well-annotated samples, 10 times larger than its predecessor, the famous MSTAR. Building such a large dataset is a challenging task, and the data collection scheme will be detailed. Secondly, we illustrate the value of ATRNet-STAR via extensively evaluating the performance of 15 representative methods with 7 different experimental settings on challenging classification and detection benchmarks derived from the dataset. Finally, based on our extensive experiments, we identify valuable insights for SAR ATR and discuss potential future research directions in this field. We hope that the scale, diversity, and benchmark of ATRNet-STAR can significantly facilitate the advancement of SAR ATR.
中文: 针对合成孔径雷达自动目标识别领域缺乏大规模高质量数据集的问题,本文推出了ATRNet-STAR数据集,包含40类车辆目标和19万余标注样本,通过全面基准测试显著推动了该领域的发展。
English: The lack of large-scale, high-quality datasets has limited the application of deep learning in Synthetic Aperture Radar Automatic Target Recognition (SAR ATR), prompting the introduction of ATRNet-STAR, a new dataset with 40 vehicle categories and over 190,000 annotated samples, which significantly advances the field through extensive benchmark evaluations.

Authors:Joshua Park, Yongfeng Zhang
Title: AgentRec: Agent Recommendation Using Sentence Embeddings Aligned to Human Feedback
Abstract:
Multi-agent systems must decide which agent is the most appropriate for a given task. We propose a novel architecture for recommending which LLM agent out of many should perform a task given a natural language prompt by extending the Sentence-BERT (SBERT) encoder model. On test data, we are able to achieve a top-1 accuracy of 92.2% with each classification taking less than 300 milliseconds. In contrast to traditional classification methods, our architecture is computationally cheap, adaptive to new classes, interpretable, and controllable with arbitrary metrics through reinforcement learning. By encoding natural language prompts into sentence embeddings, our model captures the semantic content relevant to recommending an agent. The distance between sentence embeddings that belong to the same agent is then minimized through fine-tuning and aligned to human values through reinforcement learning from human feedback. This allows the classification of natural language prompts based on their nearest neighbors by measuring the cosine similarity between embeddings. This work is made possible through the generation of a synthetic dataset for agent recommendation, which we have open-sourced to the public along with the code for AgentRec recommendation system at https://github.com/joshprk/agentrec.
Chinese: 该研究提出了一种基于Sentence-BERT的新架构,通过自然语言提示推荐最适合任务的LLM代理,实现了92.2%的Top-1准确率,并通过强化学习提供了可解释性和适应性。
English: The study introduces a novel architecture using Sentence-BERT to recommend the most suitable LLM agent for tasks based on natural language prompts, achieving 92.2% top-1 accuracy efficiently and offering interpretability and adaptability through reinforcement learning.

Authors:Yang Bai, Christan Earl Grant, Daisy Zhe Wang
Title: RAMQA: A Unified Framework for Retrieval-Augmented Multi-Modal Question Answering
Abstract:
Multi-modal retrieval-augmented Question Answering (MRAQA), integrating text and images, has gained significant attention in information retrieval (IR) and natural language processing (NLP). Traditional ranking methods rely on small encoder-based language models, which are incompatible with modern decoder-based generative large language models (LLMs) that have advanced various NLP tasks. To bridge this gap, we propose RAMQA, a unified framework combining learning-to-rank methods with generative permutation-enhanced ranking techniques. We first train a pointwise multi-modal ranker using LLaVA as the backbone. Then, we apply instruction tuning to train a LLaMA model for re-ranking the top-k documents using an innovative autoregressive multi-task learning approach. Our generative ranking model generates re-ranked document IDs and specific answers from document candidates in various permutations. Experiments on two MRAQA benchmarks, WebQA and MultiModalQA, show significant improvements over strong baselines, highlighting the effectiveness of our approach. Code and data are available at: https://github.com/TonyBY/RAMQA
中文摘要:RAMQA框架通过结合排序学习方法和生成式重排技术,弥合了传统排序模型与大型语言模型之间的差距,在多模态问答基准测试中实现了显著性能提升。
English Summary: The proposed RAMQA framework integrates learning-to-rank methods with generative permutation techniques to bridge the gap between traditional ranking models and modern LLMs, achieving significant performance improvements on multi-modal QA benchmarks.

Authors:Daeun Jung, Jaehyeok Jang, Sooyoung Jang, Yu Rang Park
Title: MEDFORM: A Foundation Model for Contrastive Learning of CT Imaging and Clinical Numeric Data in Multi-Cancer Analysis
Abstract:
Computed tomography (CT) and clinical numeric data are essential modalities for cancer evaluation, but building large-scale multimodal training datasets for developing medical foundation models remains challenging due to the structural complexity of multi-slice CT data and high cost of expert annotation. In this study, we propose MEDFORM, a multimodal pre-training strategy that guides CT image representation learning using complementary information from clinical data for medical foundation model development. MEDFORM efficiently processes CT slice through multiple instance learning (MIL) and adopts a dual pre-training strategy: first pretraining the CT slice feature extractor using SimCLR-based self-supervised learning, then aligning CT and clinical modalities through cross-modal contrastive learning. Our model was pre-trained on three different cancer types: lung cancer (141,171 slices), breast cancer (8,100 slices), colorectal cancer (10,393 slices). The experimental results demonstrated that this dual pre-training strategy improves cancer classification performance and maintains robust performance in few-shot learning scenarios. Code available at https://github.com/DigitalHealthcareLab/25MultiModalFoundationModel.git
中文:MEDFORM提出了一种多模态预训练策略,通过自监督和跨模态对比学习将CT影像与临床数据结合,有效提升了癌症分类和小样本学习的能力。
English: MEDFORM introduces a multimodal pre-training approach that combines CT imaging with clinical data using self-supervised and cross-modal contrastive learning to enhance cancer classification and few-shot learning performance.

Authors:Alsu Sagirova, Yuri Kuratov, Mikhail Burtsev
Title: SRMT: Shared Memory for Multi-agent Lifelong Pathfinding
Abstract:
Multi-agent reinforcement learning (MARL) demonstrates significant progress in solving cooperative and competitive multi-agent problems in various environments. One of the principal challenges in MARL is the need for explicit prediction of the agents' behavior to achieve cooperation. To resolve this issue, we propose the Shared Recurrent Memory Transformer (SRMT) which extends memory transformers to multi-agent settings by pooling and globally broadcasting individual working memories, enabling agents to exchange information implicitly and coordinate their actions. We evaluate SRMT on the Partially Observable Multi-Agent Pathfinding problem in a toy Bottleneck navigation task that requires agents to pass through a narrow corridor and on a POGEMA benchmark set of tasks. In the Bottleneck task, SRMT consistently outperforms a variety of reinforcement learning baselines, especially under sparse rewards, and generalizes effectively to longer corridors than those seen during training. On POGEMA maps, including Mazes, Random, and MovingAI, SRMT is competitive with recent MARL, hybrid, and planning-based algorithms. These results suggest that incorporating shared recurrent memory into the transformer-based architectures can enhance coordination in decentralized multi-agent systems. The source code for training and evaluation is available on GitHub: https://github.com/Aloriosa/srmt.
Chinese: 共享循环记忆变换器(SRMT)通过汇集并全局广播各智能体的工作记忆,在多智能体路径规划任务中表现出色,尤其在稀疏奖励环境下优于基线方法,并在POGEMA基准测试中与先进算法性能相当。
English: The Shared Recurrent Memory Transformer (SRMT) enhances multi-agent coordination by pooling and broadcasting agents' working memories, achieving superior performance in navigation tasks and competitive results on POGEMA benchmarks compared to existing methods.

Authors:Yichen Wu, Hongming Piao, Long-Kai Huang, Renzhen Wang, Wanhua Li, Hanspeter Pfister, Deyu Meng, Kede Ma, Ying Wei
Title: SD-LoRA: Scalable Decoupled Low-Rank Adaptation for Class Incremental Learning
Abstract:
Continual Learning (CL) with foundation models has recently emerged as a promising paradigm to exploit abundant knowledge acquired during pre-training for tackling sequential tasks. However, existing prompt-based and Low-Rank Adaptation-based (LoRA-based) methods often require expanding a prompt/LoRA pool or retaining samples of previous tasks, which poses significant scalability challenges as the number of tasks grows. To address these limitations, we propose Scalable Decoupled LoRA (SD-LoRA) for class incremental learning, which continually separates the learning of the magnitude and direction of LoRA components without rehearsal. Our empirical and theoretical analysis reveals that SD-LoRA tends to follow a low-loss trajectory and converges to an overlapping low-loss region for all learned tasks, resulting in an excellent stability-plasticity trade-off. Building upon these insights, we introduce two variants of SD-LoRA with further improved parameter efficiency. All parameters of SD-LoRAs can be end-to-end optimized for CL objectives. Meanwhile, they support efficient inference by allowing direct evaluation with the finally trained model, obviating the need for component selection. Extensive experiments across multiple CL benchmarks and foundation models consistently validate the effectiveness of SD-LoRA. The code is available at https://github.com/WuYichen-97/SD-Lora-CL.
中文:SD-LoRA通过解耦LoRA组件的幅度和方向学习,解决了持续学习中的扩展性难题,无需样本回放即可实现高效的类增量学习,并在多个基准测试中展现出优越的性能。
English: SD-LoRA addresses scalability challenges in continual learning by decoupling the optimization of LoRA components' magnitude and direction, enabling efficient, rehearsal-free class incremental learning with strong empirical performance and theoretical backing.

Authors:Adam Tupper, Christian Gagné
Title: Revisiting Data Augmentation for Ultrasound Images
Abstract:
Data augmentation is a widely used and effective technique to improve the generalization performance of deep neural networks. Yet, despite often facing limited data availability when working with medical images, it is frequently underutilized. This appears to come from a gap in our collective understanding of the efficacy of different augmentation techniques across different tasks and modalities. One modality where this is especially true is ultrasound imaging. This work addresses this gap by analyzing the effectiveness of different augmentation techniques at improving model performance across a wide range of ultrasound image analysis tasks. To achieve this, we introduce a new standardized benchmark of 14 ultrasound image classification and semantic segmentation tasks from 10 different sources and covering 11 body regions. Our results demonstrate that many of the augmentations commonly used for tasks on natural images are also effective on ultrasound images, even more so than augmentations developed specifically for ultrasound images in some cases. We also show that diverse augmentation using TrivialAugment, which is widely used for natural images, is also effective for ultrasound images. Moreover, our proposed methodology represents a structured approach for assessing various data augmentations that can be applied to other contexts and modalities.
中文: 本研究评估了多种数据增强技术在超声图像分析中的效果,发现常用于自然图像的方法常优于超声专用增强技术,并提出了标准化基准以指导未来研究。
English: This study evaluates the effectiveness of various data augmentation techniques for ultrasound image analysis, revealing that methods commonly used for natural images often outperform ultrasound-specific augmentations, and introduces a standardized benchmark to guide future research.

Authors:Qiongyan Wang, Yutong Xia, Siru ZHong, Weichuang Li, Yuankai Wu, Shifen Cheng, Junbo Zhang, Yu Zheng, Yuxuan Liang
Title: AirRadar: Inferring Nationwide Air Quality in China with Deep Neural Networks
Abstract:
Monitoring real-time air quality is essential for safeguarding public health and fostering social progress. However, the widespread deployment of air quality monitoring stations is constrained by their significant costs. To address this limitation, we introduce \emph{AirRadar}, a deep neural network designed to accurately infer real-time air quality in locations lacking monitoring stations by utilizing data from existing ones. By leveraging learnable mask tokens, AirRadar reconstructs air quality features in unmonitored regions. Specifically, it operates in two stages: first capturing spatial correlations and then adjusting for distribution shifts. We validate AirRadar's efficacy using a year-long dataset from 1,085 monitoring stations across China, demonstrating its superiority over multiple baselines, even with varying degrees of unobserved data. The source code can be accessed at https://github.com/CityMind-Lab/AirRadar.
中文: AirRadar是一种深度神经网络,通过利用现有监测站数据,采用捕捉空间相关性和调整分布偏移的两阶段方法,精准推断未监测区域的实时空气质量,在中国大规模数据集上验证了其优越性能。
English: AirRadar is a deep neural network that accurately infers real-time air quality in unmonitored areas by leveraging data from existing stations through a two-stage process of capturing spatial correlations and adjusting for distribution shifts, validated as superior on a large Chinese dataset.

Authors:Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, Peng Jin, Wenqi Zhang, Fan Wang, Lidong Bing, Deli Zhao
Title: VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
Abstract:
In this paper, we propose VideoLLaMA3, a more advanced multimodal foundation model for image and video understanding. The core design philosophy of VideoLLaMA3 is vision-centric. The meaning of "vision-centric" is two-fold: the vision-centric training paradigm and vision-centric framework design. The key insight of our vision-centric training paradigm is that high-quality image-text data is crucial for both image and video understanding. Instead of preparing massive video-text datasets, we focus on constructing large-scale and high-quality image-text datasets. VideoLLaMA3 has four training stages: 1) Vision Encoder Adaptation, which enables vision encoder to accept images of variable resolutions as input; 2) Vision-Language Alignment, which jointly tunes the vision encoder, projector, and LLM with large-scale image-text data covering multiple types (including scene images, documents, charts) as well as text-only data. 3) Multi-task Fine-tuning, which incorporates image-text SFT data for downstream tasks and video-text data to establish a foundation for video understanding. 4) Video-centric Fine-tuning, which further improves the model's capability in video understanding. As for the framework design, to better capture fine-grained details in images, the pretrained vision encoder is adapted to encode images of varying sizes into vision tokens with corresponding numbers, rather than a fixed number of tokens. For video inputs, we reduce the number of vision tokens according to their similarity so that the representation of videos will be more precise and compact. Benefit from vision-centric designs, VideoLLaMA3 achieves compelling performances in both image and video understanding benchmarks.
中文摘要:VideoLLaMA3提出了一种以视觉为中心的多模态基础模型,通过高质量图文数据和自适应视觉标记处理,在图像与视频理解任务中均表现出色。
English Summary: VideoLLaMA3 introduces a vision-centric multimodal foundation model that prioritizes high-quality image-text data and adaptive visual token processing to excel in both image and video understanding tasks.

Authors:Jiachen Lei, Julius Berner, Jiongxiao Wang, Zhongzhu Chen, Zhongjia Ba, Kui Ren, Jun Zhu, Anima Anandkumar
Title: Robust Representation Consistency Model via Contrastive Denoising
Abstract:
Robustness is essential for deep neural networks, especially in security-sensitive applications. To this end, randomized smoothing provides theoretical guarantees for certifying robustness against adversarial perturbations. Recently, diffusion models have been successfully employed for randomized smoothing to purify noise-perturbed samples before making predictions with a standard classifier. While these methods excel at small perturbation radii, they struggle with larger perturbations and incur a significant computational overhead during inference compared to classical methods. To address this, we reformulate the generative modeling task along the diffusion trajectories in pixel space as a discriminative task in the latent space. Specifically, we use instance discrimination to achieve consistent representations along the trajectories by aligning temporally adjacent points. After fine-tuning based on the learned representations, our model enables implicit denoising-then-classification via a single prediction, substantially reducing inference costs. We conduct extensive experiments on various datasets and achieve state-of-the-art performance with minimal computation budget during inference. For example, our method outperforms the certified accuracy of diffusion-based methods on ImageNet across all perturbation radii by 5.3% on average, with up to 11.6% at larger radii, while reducing inference costs by 85$\times$ on average. Codes are available at: https://github.com/jiachenlei/rRCM.
中文: 本研究提出了一种新方法,将生成式扩散模型重构为潜在空间中的判别任务,显著提升了对抗扰动的鲁棒性,同时将推理成本降低85倍,并在多个数据集上实现了最先进的认证精度。
English: This study introduces a novel method that reformulates generative diffusion modeling as a discriminative task in latent space, significantly enhancing robustness against adversarial perturbations while reducing inference costs by 85 times and achieving state-of-the-art certified accuracy across various datasets.

Authors:Bohao Yang, Yingji Zhang, Dong Liu, André Freitas, Chenghua Lin
Title: Does Table Source Matter? Benchmarking and Improving Multimodal Scientific Table Understanding and Reasoning
Abstract:
Recent large language models (LLMs) have advanced table understanding capabilities but rely on converting tables into text sequences. While multimodal large language models (MLLMs) enable direct visual processing, they face limitations in handling scientific tables due to fixed input image resolutions and insufficient numerical reasoning capabilities. We present a comprehensive framework for multimodal scientific table understanding and reasoning with dynamic input image resolutions. Our framework consists of three key components: (1) MMSci-Pre, a domain-specific table structure learning dataset of 52K scientific table structure recognition samples, (2) MMSci-Ins, an instruction tuning dataset with 12K samples across three table-based tasks, and (3) MMSci-Eval, a benchmark with 3,114 testing samples specifically designed to evaluate numerical reasoning capabilities. Extensive experiments demonstrate that our domain-specific approach with 52K scientific table images achieves superior performance compared to 150K general-domain tables, highlighting the importance of data quality over quantity. Our proposed table-based MLLMs with dynamic input resolutions show significant improvements in both general table understanding and numerical reasoning capabilities, with strong generalisation to held-out datasets. Our code and data are publicly available at https://github.com/Bernard-Yang/MMSci_Table.
中文: 本研究提出了一种动态分辨率的多模态框架,通过领域专用数据集提升科学表格理解能力,实验证明以质量为导向的训练方法优于传统大规模数据训练。
English: The study introduces a multimodal framework with dynamic image resolution to enhance scientific table understanding, featuring domain-specific datasets and demonstrating superior performance through quality-focused training.

Authors:Yifan Hu, Guibin Zhang, Peiyuan Liu, Disen Lan, Naiqi Li, Dawei Cheng, Tao Dai, Shu-Tao Xia, Shirui Pan
Title: TimeFilter: Patch-Specific Spatial-Temporal Graph Filtration for Time Series Forecasting
Abstract:
Time series forecasting methods generally fall into two main categories: Channel Independent (CI) and Channel Dependent (CD) strategies. While CI overlooks important covariate relationships, CD captures all dependencies without distinction, introducing noise and reducing generalization. Recent advances in Channel Clustering (CC) aim to refine dependency modeling by grouping channels with similar characteristics and applying tailored modeling techniques. However, coarse-grained clustering struggles to capture complex, time-varying interactions effectively. To address these challenges, we propose TimeFilter, a GNN-based framework for adaptive and fine-grained dependency modeling. After constructing the graph from the input sequence, TimeFilter refines the learned spatial-temporal dependencies by filtering out irrelevant correlations while preserving the most critical ones in a patch-specific manner. Extensive experiments on 13 real-world datasets from diverse application domains demonstrate the state-of-the-art performance of TimeFilter. The code is available at https://github.com/TROUBADOUR000/TimeFilter.
中文: TimeFilter是一种基于图神经网络的框架,通过自适应地过滤无关的通道间依赖关系并保留关键关联,提升了时间序列预测性能,在多个现实数据集中实现了领先水平。
English: TimeFilter is a GNN-based framework that enhances time series forecasting by adaptively filtering out irrelevant inter-channel dependencies while preserving critical ones, achieving state-of-the-art results across diverse datasets.

Authors:Hong Wang, Yinglong Zhang, Zhangqi Zhao, Zhicong Cai, Xuewen Xia, Xing Xu
Title: Less is More: Simple yet Effective Heuristic Community Detection with Graph Convolution Network
Abstract:
Community detection is crucial in data mining. Traditional methods primarily focus on graph structure, often neglecting the significance of attribute features. In contrast, deep learning-based approaches incorporate attribute features and local structural information through contrastive learning, improving detection performance. However, existing algorithms' complex design and joint optimization make them difficult to train and reduce detection efficiency. Additionally, these methods require the number of communities to be predefined, making the results susceptible to artificial interference. To address these challenges, we propose a simple yet effective community detection algorithm that can adaptively detect communities without relying on data augmentation and contrastive optimization. The proposed algorithm first performs community pre-detection to extract global structural information adaptively. It then utilizes GCN to integrate local structures and attribute features. Subsequently, it combines global, local structures and attribute features in the feature space to discover community affiliations. Finally, a modularity maximization method is employed to optimize the communities based on these three types of information, thereby uncovering the community affiliation of each node. We conduct experimental comparisons across various graph datasets, evaluating the proposed algorithm against traditional methods and state-of-the-art community detection algorithms. The experimental results demonstrate that our algorithm achieves greater efficiency and accuracy in terms of both detection speed and effectiveness. The code is available at https://github.com/wuanghoong/Less-is-More.git.
中文: 本研究提出了一种简单有效的社区检测算法,通过GCN和模块化优化自适应整合全局与局部结构信息及属性特征,无需预定义社区数量或复杂对比学习,即可实现更高的检测效率和精度。
English: This study introduces a simple yet effective community detection algorithm that adaptively integrates global and local structural information with attribute features using GCN and modularity optimization, achieving superior efficiency and accuracy without requiring predefined community numbers or complex contrastive learning.

Authors:Yafu Li, Xuyang Hu, Xiaoye Qu, Linjie Li, Yu Cheng
Title: Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback
Abstract:
Large language models (LLMs) demonstrate impressive performance but lack the flexibility to adapt to human preferences quickly without retraining. In this work, we introduce Test-time Preference Optimization (TPO), a framework that aligns LLM outputs with human preferences during inference, removing the need to update model parameters. Rather than relying on purely numerical rewards, TPO translates reward signals into textual critiques and uses them as textual rewards to iteratively refine its response. Evaluations on benchmarks covering instruction following, preference alignment, safety, and mathematics reveal that TPO progressively improves alignment with human preferences. Notably, after only a few TPO steps, the initially unaligned Llama-3.1-70B-SFT model can surpass the aligned counterpart, Llama-3.1-70B-Instruct. Furthermore, TPO scales efficiently with both the search width and depth during inference. Through case studies, we illustrate how TPO exploits the innate capacity of LLM to interpret and act upon reward signals. Our findings establish TPO as a practical, lightweight alternative for test-time preference optimization, achieving alignment on the fly. Our code is publicly available at https://github.com/yafuly/TPO.
Chinese: 本文提出的测试时偏好优化(TPO)框架,通过在推理过程中将奖励信号转化为文本评价来对齐大语言模型与人类偏好,无需重新训练模型。
English: This paper introduces Test-time Preference Optimization (TPO), a framework that aligns large language models with human preferences during inference by converting reward signals into textual critiques, eliminating the need for model retraining.

Authors:Ruicheng Zhang, Haowei Guo, Zeyu Zhang, Puxin Yan, Shen Zhao
Title: GAMED-Snake: Gradient-aware Adaptive Momentum Evolution Deep Snake Model for Multi-organ Segmentation
Abstract:
Multi-organ segmentation is a critical yet challenging task due to complex anatomical backgrounds, blurred boundaries, and diverse morphologies. This study introduces the Gradient-aware Adaptive Momentum Evolution Deep Snake (GAMED-Snake) model, which establishes a novel paradigm for contour-based segmentation by integrating gradient-based learning with adaptive momentum evolution mechanisms. The GAMED-Snake model incorporates three major innovations: First, the Distance Energy Map Prior (DEMP) generates a pixel-level force field that effectively attracts contour points towards the true boundaries, even in scenarios with complex backgrounds and blurred edges. Second, the Differential Convolution Inception Module (DCIM) precisely extracts comprehensive energy gradients, significantly enhancing segmentation accuracy. Third, the Adaptive Momentum Evolution Mechanism (AMEM) employs cross-attention to establish dynamic features across different iterations of evolution, enabling precise boundary alignment for diverse morphologies. Experimental results on four challenging multi-organ segmentation datasets demonstrate that GAMED-Snake improves the mDice metric by approximately 2% compared to state-of-the-art methods. Code will be available at https://github.com/SYSUzrc/GAMED-Snake.
Chinese: 本研究提出的GAMED-Snake模型通过融合梯度学习和自适应动量演化机制,改进了多器官分割中的边界对齐和精度,在四个数据集上的实验表明其mDice指标比现有最优方法提升约2%。
English: This study introduces the GAMED-Snake model, which integrates gradient-based learning with adaptive momentum evolution to enhance multi-organ segmentation by improving boundary alignment and accuracy, achieving a 2% mDice improvement over state-of-the-art methods.

Authors:Viktor Moskvoretskii, Maria Lysyuk, Mikhail Salnikov, Nikolay Ivanov, Sergey Pletenev, Daria Galimzianova, Nikita Krayko, Vasily Konovalov, Irina Nikishina, Alexander Panchenko
Title: Adaptive Retrieval Without Self-Knowledge? Bringing Uncertainty Back Home
Abstract:
Retrieval Augmented Generation (RAG) improves correctness of Question Answering (QA) and addresses hallucinations in Large Language Models (LLMs), yet greatly increase computational costs. Besides, RAG is not always needed as may introduce irrelevant information. Recent adaptive retrieval methods integrate LLMs' intrinsic knowledge with external information appealing to LLM self-knowledge, but they often neglect efficiency evaluations and comparisons with uncertainty estimation techniques. We bridge this gap by conducting a comprehensive analysis of 35 adaptive retrieval methods, including 8 recent approaches and 27 uncertainty estimation techniques, across 6 datasets using 10 metrics for QA performance, self-knowledge, and efficiency. Our findings show that uncertainty estimation techniques often outperform complex pipelines in terms of efficiency and self-knowledge, while maintaining comparable QA performance.
中文摘要:对RAG系统中自适应检索方法的全面评估表明,不确定性估计技术在效率和自我认知方面常优于复杂流程,同时保持相当的问答性能。
English Summary: Adaptive retrieval methods in RAG systems are comprehensively evaluated, revealing that uncertainty estimation techniques often surpass complex pipelines in efficiency and self-knowledge while maintaining similar QA performance.

Authors:Ruicheng Zhang, Kanghui Tian, Zeyu Zhang, Qixiang Liu, Zhi Jin
Title: FDG-Diff: Frequency-Domain-Guided Diffusion Framework for Compressed Hazy Image Restoration
Abstract:
In this study, we reveal that the interaction between haze degradation and JPEG compression introduces complex joint loss effects, which significantly complicate image restoration. Existing dehazing models often neglect compression effects, which limits their effectiveness in practical applications. To address these challenges, we introduce three key contributions. First, we design FDG-Diff, a novel frequency-domain-guided dehazing framework that improves JPEG image restoration by leveraging frequency-domain information. Second, we introduce the High-Frequency Compensation Module (HFCM), which enhances spatial-domain detail restoration by incorporating frequency-domain augmentation techniques into a diffusion-based restoration framework. Lastly, the introduction of the Degradation-Aware Denoising Timestep Predictor (DADTP) module further enhances restoration quality by enabling adaptive region-specific restoration, effectively addressing regional degradation inconsistencies in compressed hazy images. Experimental results across multiple compressed dehazing datasets demonstrate that our method consistently outperforms the latest state-of-the-art approaches. Code be available at https://github.com/SYSUzrc/FDG-Diff.
中文摘要:本研究提出FDG-Diff框架,通过频域引导的去雾方法和创新模块有效解决雾霾与JPEG压缩的共同退化问题,在图像恢复效果上超越现有先进方法。
English Summary: This study introduces FDG-Diff, a frequency-domain-guided dehazing framework that effectively addresses the joint degradation effects of haze and JPEG compression through innovative modules for enhanced image restoration, outperforming existing methods.

Authors:Xiaolei Chen, Junchi Yan, Wenlong Liao, Tao He, Pai Peng
Title: Int2Planner: An Intention-based Multi-modal Motion Planner for Integrated Prediction and Planning
Abstract:
Motion planning is a critical module in autonomous driving, with the primary challenge of uncertainty caused by interactions with other participants. As most previous methods treat prediction and planning as separate tasks, it is difficult to model these interactions. Furthermore, since the route path navigates ego vehicles to a predefined destination, it provides relatively stable intentions for ego vehicles and helps constrain uncertainty. On this basis, we construct Int2Planner, an \textbf{Int}ention-based \textbf{Int}egrated motion \textbf{Planner} achieves multi-modal planning and prediction. Instead of static intention points, Int2Planner utilizes route intention points for ego vehicles and generates corresponding planning trajectories for each intention point to facilitate multi-modal planning. The experiments on the private dataset and the public nuPlan benchmark show the effectiveness of route intention points, and Int2Planner achieves state-of-the-art performance. We also deploy it in real-world vehicles and have conducted autonomous driving for hundreds of kilometers in urban areas. It further verifies that Int2Planner can continuously interact with the traffic environment. Code will be avaliable at https://github.com/cxlz/Int2Planner.
中文摘要:Int2Planner是一种基于路径意图点的集成运动规划器,通过多模态规划与预测应对自动驾驶中的交互不确定性,在测试和实际部署中均展现出领先性能。
English Summary: Int2Planner is an integrated motion planner that uses route intention points to address uncertainty in autonomous driving by enabling multi-modal planning and prediction, achieving state-of-the-art performance in tests and real-world deployment.

Authors:Maxime Maria, Simon Guionnière, Nicolas Dacquay, Cyprien Plateau-Holleville, Valentin Guillaume, Vincent Larroque, Jean Lardé, Yassine Naimi, Jean-Philip Piquemal, Guillaume Levieux, Nathalie Lagarde, Stéphane Mérillou, Matthieu Montes
Title: VTX: Real-time high-performance molecular structure and dynamics visualization software
Abstract:
Summary: VTX is a molecular visualization software capable to handle most molecular structures and dynamics trajectories file formats. It features a real-time high-performance molecular graphics engine, based on modern OpenGL, optimized for the visualization of massive molecular systems and molecular dynamics trajectories. VTX includes multiple interactive camera and user interaction features, notably free-fly navigation and a fully modular graphical user interface designed for increased usability. It allows the production of high-resolution images for presentations and posters with custom background. VTX design is focused on performance and usability for research, teaching and educative purposes. Availability and implementation: VTX is open source and free for non commercial use. Builds for Windows and Ubuntu Linux are available at http://vtx.drugdesign.fr. The source code is available at https://github.com/VTX-Molecular-Visualization . Supplementary Information: A video displaying free-fly navigation in a whole-cell model is available
中文: VTX是一款开源分子可视化软件,具备高性能图形引擎,可处理大型分子系统和动态轨迹,并配有交互式导航和模块化界面,专为科研与教学而设计。
English: VTX is an open-source molecular visualization software featuring a high-performance graphics engine for handling large molecular systems and dynamics, with interactive navigation and a modular interface designed for research and education.

Authors:Jesus Renero, Idoia Ochoa, Roberto Maestre
Title: REX: Causal Discovery based on Machine Learning and Explainability techniques
Abstract:
Explainability techniques hold significant potential for enhancing the causal discovery process, which is crucial for understanding complex systems in areas like healthcare, economics, and artificial intelligence. However, no causal discovery methods currently incorporate explainability into their models to derive causal graphs. Thus, in this paper we explore this innovative approach, as it offers substantial potential and represents a promising new direction worth investigating. Specifically, we introduce REX, a causal discovery method that leverages machine learning (ML) models coupled with explainability techniques, specifically Shapley values, to identify and interpret significant causal relationships among variables. Comparative evaluations on synthetic datasets comprising continuous tabular data reveal that REX outperforms state-of-the-art causal discovery methods across diverse data generation processes, including non-linear and additive noise models. Moreover, REX was tested on the Sachs single-cell protein-signaling dataset, achieving a precision of 0.952 and recovering key causal relationships with no incorrect edges. Taking together, these results showcase REX's effectiveness in accurately recovering true causal structures while minimizing false positive predictions, its robustness across diverse datasets, and its applicability to real-world problems. By combining ML and explainability techniques with causal discovery, REX bridges the gap between predictive modeling and causal inference, offering an effective tool for understanding complex causal structures. REX is publicly available at https://github.com/renero/causalgraph.
Chinese: 本文提出REX这一新型因果发现方法,通过将机器学习与可解释性技术(如Shapley值)相结合来准确识别因果关系,在合成和真实数据集上均展现出优于现有方法的性能表现。
English: This paper introduces REX, a novel causal discovery method that integrates machine learning with explainability techniques like Shapley values to accurately identify causal relationships, demonstrating superior performance over existing methods in both synthetic and real-world datasets.

Authors:Haocheng Luo, Tuan Truong, Tung Pham, Mehrtash Harandi, Dinh Phung, Trung Le
Title: Explicit Eigenvalue Regularization Improves Sharpness-Aware Minimization
Abstract:
Sharpness-Aware Minimization (SAM) has attracted significant attention for its effectiveness in improving generalization across various tasks. However, its underlying principles remain poorly understood. In this work, we analyze SAM's training dynamics using the maximum eigenvalue of the Hessian as a measure of sharpness, and propose a third-order stochastic differential equation (SDE), which reveals that the dynamics are driven by a complex mixture of second- and third-order terms. We show that alignment between the perturbation vector and the top eigenvector is crucial for SAM's effectiveness in regularizing sharpness, but find that this alignment is often inadequate in practice, limiting SAM's efficiency. Building on these insights, we introduce Eigen-SAM, an algorithm that explicitly aims to regularize the top Hessian eigenvalue by aligning the perturbation vector with the leading eigenvector. We validate the effectiveness of our theory and the practical advantages of our proposed approach through comprehensive experiments. Code is available at https://github.com/RitianLuo/EigenSAM.
中文: 本研究分析了锐度感知最小化(SAM)方法,发现其有效性依赖于扰动向量与海森矩阵主特征向量的对齐,据此提出了Eigen-SAM算法,通过显式优化这种对齐来提升正则化效果和计算效率。
English: This study analyzes Sharpness-Aware Minimization (SAM) and reveals that its effectiveness depends on aligning the perturbation vector with the Hessian's top eigenvector, leading to the development of Eigen-SAM, which explicitly optimizes this alignment to improve regularization and efficiency.

Authors:Qiong Wu, Maoxin Ji, Pingyi Fan, Kezhi Wang, Nan Cheng, Wen Chen, Khaled B. Letaief
Title: PPO-Based Vehicle Control for Ramp Merging Scheme Assisted by Enhanced C-V2X
Abstract:
On-ramp merging presents a critical challenge in autonomous driving, as vehicles from merging lanes need to dynamically adjust their positions and speeds while monitoring traffic on the main road to prevent collisions. To address this challenge, we propose a novel merging control scheme based on reinforcement learning, which integrates lateral control mechanisms. This approach ensures the smooth integration of vehicles from the merging lane onto the main road, optimizing both fuel efficiency and passenger comfort. Furthermore, we recognize the impact of vehicle-to-vehicle (V2V) communication on control strategies and introduce an enhanced protocol leveraging Cellular Vehicle-to-Everything (C-V2X) Mode 4. This protocol aims to reduce the Age of Information (AoI) and improve communication reliability. In our simulations, we employ two AoI-based metrics to rigorously assess the protocol's effectiveness in autonomous driving scenarios. By combining the NS3 network simulator with Python, we simulate V2V communication and vehicle control simultaneously. The results demonstrate that the enhanced C-V2X Mode 4 outperforms the standard version, while the proposed control scheme ensures safe and reliable vehicle operation during on-ramp merging.
中文: 基于强化学习的匝道汇合控制方案与增强型C-V2X Mode 4协议,通过优化车辆协调和通信可靠性,显著提升了自动驾驶匝道汇入的安全性与效率。
English: The proposed reinforcement learning-based merging control scheme with enhanced C-V2X Mode 4 protocol significantly improves safety and efficiency in autonomous on-ramp merging by optimizing vehicle coordination and communication reliability.

Authors:Dunwei Tu, Huiyu Yi, Yuchi Wang, Baile Xu, Jian Zhao, Furao Shen
Title: Multiple Queries with Multiple Keys: A Precise Prompt Matching Paradigm for Prompt-based Continual Learning
Abstract:
Continual learning requires machine learning models to continuously acquire new knowledge in dynamic environments while avoiding the forgetting of previous knowledge. Prompt-based continual learning methods effectively address the issue of catastrophic forgetting through prompt expansion and selection. However, existing approaches often suffer from low accuracy in prompt selection, which can result in the model receiving biased knowledge and making biased predictions. To address this issue, we propose the Multiple Queries with Multiple Keys (MQMK) prompt matching paradigm for precise prompt selection. The goal of MQMK is to select the prompts whose training data distribution most closely matches that of the test sample. Specifically, Multiple Queries enable precise breadth search by introducing task-specific knowledge, while Multiple Keys perform deep search by representing the feature distribution of training samples at a fine-grained level. Each query is designed to perform local matching with a designated task to reduce interference across queries. Experiments show that MQMK enhances the prompt matching rate by over 30\% in challenging scenarios and achieves state-of-the-art performance on three widely adopted continual learning benchmarks. The code is available at https://github.com/DunweiTu/MQMK.
Chinese: 提出的多查询多密钥(MQMK)范式通过实现精确的广度和深度搜索,显著提升了持续学习中的提示选择准确性,在基准测试中实现了超过30%的匹配率提升和最先进的性能。
English: The proposed Multiple Queries with Multiple Keys (MQMK) paradigm significantly improves prompt selection accuracy in continual learning by enabling precise breadth and depth searches, achieving over 30% higher matching rates and state-of-the-art performance on benchmarks.

Authors:Mingqi Yuan, Bo Li, Xin Jin, Wenjun Zeng
Title: Adaptive Data Exploitation in Deep Reinforcement Learning
Abstract:
We introduce ADEPT: Adaptive Data ExPloiTation, a simple yet powerful framework to enhance the **data efficiency** and **generalization** in deep reinforcement learning (RL). Specifically, ADEPT adaptively manages the use of sampled data across different learning stages via multi-armed bandit (MAB) algorithms, optimizing data utilization while mitigating overfitting. Moreover, ADEPT can significantly reduce the computational overhead and accelerate a wide range of RL algorithms. We test ADEPT on benchmarks including Procgen, MiniGrid, and PyBullet. Extensive simulation demonstrates that ADEPT can achieve superior performance with remarkable computational efficiency, offering a practical solution to data-efficient RL. Our code is available at https://github.com/yuanmingqi/ADEPT.
Chinese: ADEPT是一种创新的强化学习框架,通过多臂老虎机算法自适应管理数据使用,显著提升数据效率和泛化能力,在降低计算开销的同时,于多个基准测试中实现了卓越性能。
English: ADEPT is a novel framework that improves data efficiency and generalization in deep reinforcement learning by adaptively managing data usage with multi-armed bandit algorithms, reducing computational costs while enhancing performance across various benchmarks.

Authors:Sunbowen Lee, Junting Zhou, Chang Ao, Kaige Li, Xinrun Du, Sirui He, Haihong Wu, Tianci Liu, Jiaheng Liu, Hamid Alinejad-Rokny, Min Yang, Yitao Liang, Zhoufutu Wen, Shiwen Ni
Title: Quantification of Large Language Model Distillation
Abstract:
Model distillation is a fundamental technique in building large language models (LLMs), transferring knowledge from a teacher model to a student model. However, distillation can lead to model homogenization, reducing diversity among models and impairing their ability to robustly handle complex or novel tasks. These limitations underscore the need to systematically quantify the distillation process and its impact. In this work, we propose a framework to evaluate and quantify model distillation. Our method addresses two key aspects: (1) Identifying identity cognition contradictions to assess discrepancies in how models perceive and represent identity-related information, and (2) Analyzing multi-granularity response similarities across models to measure the extent of homogenization. Experimental results demonstrate two key insights: (1) Well-known closed-source and open-source LLMs usually exhibit high distillation degrees, except for Claude, Doubao, and Gemini. (2) Base LLMs show higher distillation degrees compared to aligned LLMs. By offering a systematic approach to improve the transparency of LLM data distillation, we call for LLMs with more independent development and more transparent technical reports to improve LLMs' robustness and safety. The code and data are available under https://github.com/Aegis1863/LLMs-Distillation-Quantification.
中文: 本研究提出一个量化大语言模型蒸馏过程的框架,通过分析身份认知矛盾和多粒度响应相似性,发现除Claude、豆包和Gemini外多数模型呈现高蒸馏度,并呼吁加强独立开发和透明度以提升模型的鲁棒性与安全性。
English: This study introduces a framework to quantify model distillation in large language models, focusing on identity cognition contradictions and multi-granularity response similarities, revealing high distillation degrees in most models except Claude, Doubao, and Gemini, and advocating for more independent development and transparent reporting to enhance model robustness and safety.

Authors:Lijun Li, Zhelun Shi, Xuhao Hu, Bowen Dong, Yiran Qin, Xihui Liu, Lu Sheng, Jing Shao
Title: T2ISafety: Benchmark for Assessing Fairness, Toxicity, and Privacy in Image Generation
Abstract:
Text-to-image (T2I) models have rapidly advanced, enabling the generation of high-quality images from text prompts across various domains. However, these models present notable safety concerns, including the risk of generating harmful, biased, or private content. Current research on assessing T2I safety remains in its early stages. While some efforts have been made to evaluate models on specific safety dimensions, many critical risks remain unexplored. To address this gap, we introduce T2ISafety, a safety benchmark that evaluates T2I models across three key domains: toxicity, fairness, and bias. We build a detailed hierarchy of 12 tasks and 44 categories based on these three domains, and meticulously collect 70K corresponding prompts. Based on this taxonomy and prompt set, we build a large-scale T2I dataset with 68K manually annotated images and train an evaluator capable of detecting critical risks that previous work has failed to identify, including risks that even ultra-large proprietary models like GPTs cannot correctly detect. We evaluate 12 prominent diffusion models on T2ISafety and reveal several concerns including persistent issues with racial fairness, a tendency to generate toxic content, and significant variation in privacy protection across the models, even with defense methods like concept erasing. Data and evaluator are released under https://github.com/adwardlee/t2i_safety.
中文: T2ISafety作为一个全面的安全基准,通过评估文本到图像模型在毒性、公平性和偏见方面的表现,涵盖12项任务和44个类别,利用7万条提示和6.8万张标注图像,揭示了12种扩散模型中存在的种族偏见和有害内容生成等风险。
English: T2ISafety is a comprehensive benchmark addressing safety gaps in text-to-image models by evaluating toxicity, fairness, and bias across 12 tasks and 44 categories, using 70K prompts and 68K annotated images to reveal risks like racial bias and toxic content generation in 12 diffusion models.

Authors:Wei Tang, Yin-Fang Yang, Zhaofei Wang, Weijia Zhang, Min-Ling Zhang
Title: Multi-Instance Partial-Label Learning with Margin Adjustment
Abstract:
Multi-instance partial-label learning (MIPL) is an emerging learning framework where each training sample is represented as a multi-instance bag associated with a candidate label set. Existing MIPL algorithms often overlook the margins for attention scores and predicted probabilities, leading to suboptimal generalization performance. A critical issue with these algorithms is that the highest prediction probability of the classifier may appear on a non-candidate label. In this paper, we propose an algorithm named MIPLMA, i.e., Multi-Instance Partial-Label learning with Margin Adjustment, which adjusts the margins for attention scores and predicted probabilities. We introduce a margin-aware attention mechanism to dynamically adjust the margins for attention scores and propose a margin distribution loss to constrain the margins between the predicted probabilities on candidate and non-candidate label sets. Experimental results demonstrate the superior performance of MIPLMA over existing MIPL algorithms, as well as other well-established multi-instance learning algorithms and partial-label learning algorithms.
中文: 提出的MIPLMA算法通过动态调整注意力分数和预测概率的边界,在多示例部分标记学习中取得优越性能,实验证明其超越现有方法。
English: The proposed MIPLMA algorithm improves multi-instance partial-label learning by dynamically adjusting margins for attention scores and predicted probabilities, outperforming existing methods in experimental evaluations.

Authors:Yongduo Sui, Jie Sun, Shuyao Wang, Zemin Liu, Qing Cui, Longfei Li, Xiang Wang
Title: A Unified Invariant Learning Framework for Graph Classification
Abstract:
Invariant learning demonstrates substantial potential for enhancing the generalization of graph neural networks (GNNs) with out-of-distribution (OOD) data. It aims to recognize stable features in graph data for classification, based on the premise that these features causally determine the target label, and their influence is invariant to changes in distribution. Along this line, most studies have attempted to pinpoint these stable features by emphasizing explicit substructures in the graph, such as masked or attentive subgraphs, and primarily enforcing the invariance principle in the semantic space, i.e., graph representations. However, we argue that focusing only on the semantic space may not accurately identify these stable features. To address this, we introduce the Unified Invariant Learning (UIL) framework for graph classification. It provides a unified perspective on invariant graph learning, emphasizing both structural and semantic invariance principles to identify more robust stable features. In the graph space, UIL adheres to the structural invariance principle by reducing the distance between graphons over a set of stable features across different environments. Simultaneously, to confirm semantic invariance, UIL underscores that the acquired graph representations should demonstrate exemplary performance across diverse environments. We present both theoretical and empirical evidence to confirm our method's ability to recognize superior stable features. Moreover, through a series of comprehensive experiments complemented by in-depth analyses, we demonstrate that UIL considerably enhances OOD generalization, surpassing the performance of leading baseline methods. Our codes are available at https://github.com/yongduosui/UIL.
Chinese: 统一不变学习(UIL)框架通过同时强化结构和语义不变性原则,提升了图神经网络在分布外数据上的泛化能力,在多项实验中超越了现有基线方法的性能表现。
English: The Unified Invariant Learning (UIL) framework enhances graph neural network generalization by jointly enforcing structural and semantic invariance principles, outperforming existing methods in out-of-distribution scenarios through robust feature identification.

Authors:Haotian Luo, Li Shen, Haiying He, Yibo Wang, Shiwei Liu, Wei Li, Naiqiang Tan, Xiaochun Cao, Dacheng Tao
Title: O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning
Abstract:
Recently, long-thought reasoning LLMs, such as OpenAI's O1, adopt extended reasoning processes similar to how humans ponder over complex problems. This reasoning paradigm significantly enhances the model's problem-solving abilities and has achieved promising results. However, long-thought reasoning process leads to a substantial increase in inference time. A pressing challenge is reducing the inference overhead of long-thought LLMs while ensuring accuracy. In this paper, we experimentally demonstrate that long-thought reasoning models struggle to effectively allocate token budgets based on problem difficulty and reasoning redundancies. To address this, we propose Length-Harmonizing Fine-Tuning (O1-Pruner), aiming at minimizing reasoning overhead while maintaining accuracy. This effective fine-tuning method first estimates the LLM's baseline performance through pre-sampling and then uses RL-style fine-tuning to encourage the model to generate shorter reasoning processes under accuracy constraints. This allows the model to achieve efficient reasoning with lower redundancy while maintaining accuracy. Experiments on various mathematical reasoning benchmarks show that O1-Pruner not only significantly reduces inference overhead but also achieves higher accuracy, providing a novel and promising solution to this challenge. Our code is coming soon at https://github.com/StarDewXXX/O1-Pruner
中文:近期长思维推理大模型虽提升解题能力却显著增加推理耗时,为此提出的O1-Pruner微调方法通过优化推理长度,在保证精度的同时大幅降低计算开销。
English: Recent long-thought reasoning LLMs enhance problem-solving but increase inference time, prompting the development of O1-Pruner, a fine-tuning method that reduces reasoning redundancy while maintaining or improving accuracy.

Authors:Kevin Ta, Patrick Foley, Mattson Thieme, Abhishek Pandey, Prashant Shah
Title: Federated Discrete Denoising Diffusion Model for Molecular Generation with OpenFL
Abstract:
Generating unique molecules with biochemically desired properties to serve as viable drug candidates is a difficult task that requires specialized domain expertise. In recent years, diffusion models have shown promising results in accelerating the drug design process through AI-driven molecular generation. However, training these models requires massive amounts of data, which are often isolated in proprietary silos. OpenFL is a federated learning framework that enables privacy-preserving collaborative training across these decentralized data sites. In this work, we present a federated discrete denoising diffusion model that was trained using OpenFL. The federated model achieves comparable performance with a model trained on centralized data when evaluating the uniqueness and validity of the generated molecules. This demonstrates the utility of federated learning in the drug design process. OpenFL is available at: https://github.com/securefederatedai/openfl
中文:利用OpenFL的联邦学习实现了隐私保护的扩散模型协同训练,在生成独特且有效的药物候选分子方面达到了与集中式训练相当的性能。
English: Federated learning with OpenFL enables privacy-preserving collaborative training of diffusion models for molecular generation, achieving comparable performance to centralized training in producing unique and valid drug candidates.

Authors:Xiaoyu Chu, Sacheendra Talluri, Qingxian Lu, Alexandru Iosup
Title: An Empirical Characterization of Outages and Incidents in Public Services for Large Language Models
Abstract:
People and businesses increasingly rely on public LLM services, such as ChatGPT, DALLE, and Claude. Understanding their outages, and particularly measuring their failure-recovery processes, is becoming a stringent problem. However, only limited studies exist in this emerging area. Addressing this problem, in this work we conduct an empirical characterization of outages and failure-recovery in public LLM services. We collect and prepare datasets for 8 commonly used LLM services across 3 major LLM providers, including market-leads OpenAI and Anthropic. We conduct a detailed analysis of failure recovery statistical properties, temporal patterns, co-occurrence, and the impact range of outage-causing incidents. We make over 10 observations, among which: (1) Failures in OpenAI's ChatGPT take longer to resolve but occur less frequently than those in Anthropic's Claude;(2) OpenAI and Anthropic service failures exhibit strong weekly and monthly periodicity; and (3) OpenAI services offer better failure-isolation than Anthropic services. Our research explains LLM failure characteristics and thus enables optimization in building and using LLM systems. FAIR data and code are publicly available on https://zenodo.org/records/14018219 and https://github.com/atlarge-research/llm-service-analysis.
中文摘要:本研究对八大主流公共大语言模型服务的中断与故障恢复过程进行实证分析,揭示了OpenAI与Anthropic等供应商在故障频率、解决时间和隔离能力方面的关键差异。
English Summary: This study empirically analyzes outages and failure-recovery processes in eight major public LLM services, revealing key differences in failure frequency, resolution time, and isolation capabilities between providers like OpenAI and Anthropic.

Authors:Shanmin Wang, Chengguang Liu, Qingshan Liu
Title: Multi-Modality Collaborative Learning for Sentiment Analysis
Abstract:
Multimodal sentiment analysis (MSA) identifies individuals' sentiment states in videos by integrating visual, audio, and text modalities. Despite progress in existing methods, the inherent modality heterogeneity limits the effective capture of interactive sentiment features across modalities. In this paper, by introducing a Multi-Modality Collaborative Learning (MMCL) framework, we facilitate cross-modal interactions and capture enhanced and complementary features from modality-common and modality-specific representations, respectively. Specifically, we design a parameter-free decoupling module and separate uni-modality into modality-common and modality-specific components through semantics assessment of cross-modal elements. For modality-specific representations, inspired by the act-reward mechanism in reinforcement learning, we design policy models to adaptively mine complementary sentiment features under the guidance of a joint reward. For modality-common representations, intra-modal attention is employed to highlight crucial components, playing enhanced roles among modalities. Experimental results, including superiority evaluations on four databases, effectiveness verification of each module, and assessment of complementary features, demonstrate that MMCL successfully learns collaborative features across modalities and significantly improves performance. The code can be available at https://github.com/smwanghhh/MMCL.
中文: 本文提出了一种多模态协同学习框架,通过分别从模态共有和模态特定表示中自适应地挖掘互补与增强特征,有效提升了多模态情感分析的性能,并在多个数据库上验证了其优越性。
English: This paper introduces a Multi-Modality Collaborative Learning framework that enhances multimodal sentiment analysis by adaptively capturing complementary and enhanced features from modality-specific and modality-common representations, achieving superior performance across multiple databases.

Authors:Yonghao Zhao, Changtao Li, Chi Shu, Qingbin Wu, Hong Li, Chuan Xu, Tianrui Li, Ziqiang Wang, Zhipeng Luo, Yazhou He
Title: Tackling Small Sample Survival Analysis via Transfer Learning: A Study of Colorectal Cancer Prognosis
Abstract:
Survival prognosis is crucial for medical informatics. Practitioners often confront small-sized clinical data, especially cancer patient cases, which can be insufficient to induce useful patterns for survival predictions. This study deals with small sample survival analysis by leveraging transfer learning, a useful machine learning technique that can enhance the target analysis with related knowledge pre-learned from other data. We propose and develop various transfer learning methods designed for common survival models. For parametric models such as DeepSurv, Cox-CC (Cox-based neural networks), and DeepHit (end-to-end deep learning model), we apply standard transfer learning techniques like pretraining and fine-tuning. For non-parametric models such as Random Survival Forest, we propose a new transfer survival forest (TSF) model that transfers tree structures from source tasks and fine-tunes them with target data. We evaluated the transfer learning methods on colorectal cancer (CRC) prognosis. The source data are 27,379 SEER CRC stage I patients, and the target data are 728 CRC stage I patients from the West China Hospital. When enhanced by transfer learning, Cox-CC's $C^{td}$ value was boosted from 0.7868 to 0.8111, DeepHit's from 0.8085 to 0.8135, DeepSurv's from 0.7722 to 0.8043, and RSF's from 0.7940 to 0.8297 (the highest performance). All models trained with data as small as 50 demonstrated even more significant improvement. Conclusions: Therefore, the current survival models used for cancer prognosis can be enhanced and improved by properly designed transfer learning techniques. The source code used in this study is available at https://github.com/YonghaoZhao722/TSF.
中文: 本研究通过将迁移学习技术应用于参数化和非参数化生存模型,显著提升了小样本临床数据的生存预测性能,并在结直肠癌预后中验证了其有效性。
English: This study enhances survival prognosis for small clinical datasets by applying transfer learning techniques to both parametric and non-parametric survival models, demonstrating significant performance improvements in colorectal cancer prediction.

Authors:Jingwei Yi, Junhao Yin, Ju Xu, Peng Bao, Yongliang Wang, Wei Fan, Hao Wang
Title: ImageRef-VL: Enabling Contextual Image Referencing in Vision-Language Models
Abstract:
Vision-Language Models (VLMs) have demonstrated remarkable capabilities in understanding multimodal inputs and have been widely integrated into Retrieval-Augmented Generation (RAG) based conversational systems. While current VLM-powered chatbots can provide textual source references in their responses, they exhibit significant limitations in referencing contextually relevant images during conversations. In this paper, we introduce Contextual Image Reference -- the ability to appropriately reference relevant images from retrieval documents based on conversation context -- and systematically investigate VLMs' capability in this aspect. We conduct the first evaluation for contextual image referencing, comprising a dedicated testing dataset and evaluation metrics. Furthermore, we propose ImageRef-VL, a method that significantly enhances open-source VLMs' image referencing capabilities through instruction fine-tuning on a large-scale, manually curated multimodal conversation dataset. Experimental results demonstrate that ImageRef-VL not only outperforms proprietary models but also achieves an 88% performance improvement over state-of-the-art open-source VLMs in contextual image referencing tasks. Our code is available at https://github.com/bytedance/ImageRef-VL.
中文: 本文提出了情境图像引用能力,使视觉语言模型能够基于对话上下文恰当引用检索文档中的相关图像,并通过指令微调方法ImageRef-VL显著提升了开源模型在此任务上的性能表现。
English: This paper introduces Contextual Image Reference, a novel capability for Vision-Language Models to appropriately reference relevant images from retrieval documents based on conversation context, and presents ImageRef-VL, a method that significantly enhances open-source VLMs' performance in this task through instruction fine-tuning.

Authors:Ziming Liu, Yizhou Liu, Eric J. Michaud, Jeff Gore, Max Tegmark
Title: Physics of Skill Learning
Abstract:
We aim to understand physics of skill learning, i.e., how skills are learned in neural networks during training. We start by observing the Domino effect, i.e., skills are learned sequentially, and notably, some skills kick off learning right after others complete learning, similar to the sequential fall of domino cards. To understand the Domino effect and relevant behaviors of skill learning, we take physicists' approach of abstraction and simplification. We propose three models with varying complexities -- the Geometry model, the Resource model, and the Domino model, trading between reality and simplicity. The Domino effect can be reproduced in the Geometry model, whose resource interpretation inspires the Resource model, which can be further simplified to the Domino model. These models present different levels of abstraction and simplification; each is useful to study some aspects of skill learning. The Geometry model provides interesting insights into neural scaling laws and optimizers; the Resource model sheds light on the learning dynamics of compositional tasks; the Domino model reveals the benefits of modularity. These models are not only conceptually interesting -- e.g., we show how Chinchilla scaling laws can emerge from the Geometry model, but also are useful in practice by inspiring algorithmic development -- e.g., we show how simple algorithmic changes, motivated by these toy models, can speed up the training of deep learning models.
中文: 本研究通过几何模型、资源模型和骨牌模型三个抽象模型,探讨了神经网络中技能按顺序学习的“骨牌效应”,在简洁性与现实性间取得平衡,揭示了神经缩放规律、学习动态和模块化优势,并启发了加速训练的算法改进。
English: This study explores the sequential learning of skills in neural networks, termed the Domino effect, through three abstract models—Geometry, Resource, and Domino—that balance simplicity and reality to provide insights into neural scaling laws, learning dynamics, and modularity benefits, while also inspiring practical algorithmic improvements for faster training.

Authors:Yi Wang, Xinhao Li, Ziang Yan, Yinan He, Jiashuo Yu, Xiangyu Zeng, Chenting Wang, Changlian Ma, Haian Huang, Jianfei Gao, Min Dou, Kai Chen, Wenhai Wang, Yu Qiao, Yali Wang, Limin Wang
Title: InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling
Abstract:
This paper aims to improve the performance of video multimodal large language models (MLLM) via long and rich context (LRC) modeling. As a result, we develop a new version of InternVideo2.5 with a focus on enhancing the original MLLMs' ability to perceive fine-grained details and capture long-form temporal structure in videos. Specifically, our approach incorporates dense vision task annotations into MLLMs using direct preference optimization and develops compact spatiotemporal representations through adaptive hierarchical token compression. Experimental results demonstrate this unique design of LRC greatly improves the results of video MLLM in mainstream video understanding benchmarks (short & long), enabling the MLLM to memorize significantly longer video inputs (at least 6x longer than the original), and master specialized vision capabilities like object tracking and segmentation. Our work highlights the importance of multimodal context richness (length and fineness) in empowering MLLM's innate abilites (focus and memory), providing new insights for future research on video MLLM. Code and models are available at https://github.com/OpenGVLab/InternVideo/tree/main/InternVideo2.5
中文: 本文提出InternVideo2.5模型,通过长视频上下文建模结合自适应令牌压缩和密集标注技术,显著提升了视频多模态大语言模型对细节的感知能力和长时序理解能力。
English: This paper introduces InternVideo2.5, which enhances video multimodal large language models by incorporating long and rich context modeling to improve fine-grained perception and long-term temporal understanding through adaptive token compression and dense annotations.

Authors:Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Ziyu Liu, Shengyuan Ding, Shenxi Wu, Yubo Ma, Haodong Duan, Wenwei Zhang, Kai Chen, Dahua Lin, Jiaqi Wang
Title: InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model
Abstract:
Despite the promising performance of Large Vision Language Models (LVLMs) in visual understanding, they occasionally generate incorrect outputs. While reward models (RMs) with reinforcement learning or test-time scaling offer the potential for improving generation quality, a critical gap remains: publicly available multi-modal RMs for LVLMs are scarce, and the implementation details of proprietary models are often unclear. We bridge this gap with InternLM-XComposer2.5-Reward (IXC-2.5-Reward), a simple yet effective multi-modal reward model that aligns LVLMs with human preferences. To ensure the robustness and versatility of IXC-2.5-Reward, we set up a high-quality multi-modal preference corpus spanning text, image, and video inputs across diverse domains, such as instruction following, general understanding, text-rich documents, mathematical reasoning, and video understanding. IXC-2.5-Reward achieves excellent results on the latest multi-modal reward model benchmark and shows competitive performance on text-only reward model benchmarks. We further demonstrate three key applications of IXC-2.5-Reward: (1) Providing a supervisory signal for RL training. We integrate IXC-2.5-Reward with Proximal Policy Optimization (PPO) yields IXC-2.5-Chat, which shows consistent improvements in instruction following and multi-modal open-ended dialogue; (2) Selecting the best response from candidate responses for test-time scaling; and (3) Filtering outlier or noisy samples from existing image and video instruction tuning training data. To ensure reproducibility and facilitate further research, we have open-sourced all model weights and training recipes at https://github.com/InternLM/InternLM-XComposer/tree/main/InternLM-XComposer-2.5-Reward
Chinese: 针对大型视觉语言模型公开多模态奖励模型稀缺的问题,我们提出了InternLM-XComposer2.5-Reward这一简单高效的多模态奖励模型,通过在多样化多模态偏好数据上训练,该模型在基准测试中表现优异,并成功应用于强化学习训练和响应选择等关键场景。
English: To address the scarcity of public multi-modal reward models for Large Vision Language Models (LVLMs), we introduce InternLM-XComposer2.5-Reward, a simple yet effective model trained on a diverse multi-modal preference corpus, which achieves top performance on benchmarks and enables key applications like RL training and response selection.

Authors:Jiacheng Zuo, Haibo Hu, Zikang Zhou, Yufei Cui, Ziquan Liu, Jianping Wang, Nan Guan, Jin Wang, Chun Jason Xue
Title: RALAD: Bridging the Real-to-Sim Domain Gap in Autonomous Driving with Retrieval-Augmented Learning
Abstract:
In the pursuit of robust autonomous driving systems, models trained on real-world datasets often struggle to adapt to new environments, particularly when confronted with corner cases such as extreme weather conditions. Collecting these corner cases in the real world is non-trivial, which necessitates the use of simulators for validation. However,the high computational cost and the domain gap in data distribution have hindered the seamless transition between real and simulated driving scenarios. To tackle this challenge, we propose Retrieval-Augmented Learning for Autonomous Driving (RALAD), a novel framework designed to bridge the real-to-sim gap at a low cost. RALAD features three primary designs, including (1) domain adaptation via an enhanced Optimal Transport (OT) method that accounts for both individual and grouped image distances, (2) a simple and unified framework that can be applied to various models, and (3) efficient fine-tuning techniques that freeze the computationally expensive layers while maintaining robustness. Experimental results demonstrate that RALAD compensates for the performance degradation in simulated environments while maintaining accuracy in real-world scenarios across three different models. Taking Cross View as an example, the mIOU and mAP metrics in real-world scenarios remain stable before and after RALAD fine-tuning, while in simulated environments,the mIOU and mAP metrics are improved by 10.30% and 12.29%, respectively. Moreover, the re-training cost of our approach is reduced by approximately 88.1%. Our code is available at https://github.com/JiachengZuo/RALAD.git.
中文:RALAD框架通过增强的领域适应和高效微调技术,有效弥合自动驾驶中真实与模拟环境的差距,在保持现实场景精度的同时显著提升模拟环境性能,并将重新训练成本降低约88.1%。
English: The proposed RALAD framework effectively bridges the real-to-sim gap in autonomous driving by employing enhanced domain adaptation and efficient fine-tuning, significantly improving performance in simulated environments while maintaining real-world accuracy and reducing retraining costs by 88.1%.

Authors:Wenxin Ma, Qingsong Yao, Xiang Zhang, Zhelong Huang, Zihang Jiang, S. Kevin Zhou
Title: Towards Accurate Unified Anomaly Segmentation
Abstract:
Unsupervised anomaly detection (UAD) from images strives to model normal data distributions, creating discriminative representations to distinguish and precisely localize anomalies. Despite recent advancements in the efficient and unified one-for-all scheme, challenges persist in accurately segmenting anomalies for further monitoring. Moreover, this problem is obscured by the widely-used AUROC metric under imbalanced UAD settings. This motivates us to emphasize the significance of precise segmentation of anomaly pixels using pAP and DSC as metrics. To address the unsolved segmentation task, we introduce the Unified Anomaly Segmentation (UniAS). UniAS presents a multi-level hybrid pipeline that progressively enhances normal information from coarse to fine, incorporating a novel multi-granularity gated CNN (MGG-CNN) into Transformer layers to explicitly aggregate local details from different granularities. UniAS achieves state-of-the-art anomaly segmentation performance, attaining 65.12/59.33 and 40.06/32.50 in pAP/DSC on the MVTec-AD and VisA datasets, respectively, surpassing previous methods significantly. The codes are shared at https://github.com/Mwxinnn/UniAS.
Chinese: 本文提出统一异常分割(UniAS),这是一种多层级混合框架,通过将新型多粒度门控CNN与Transformer层结合,从粗到细逐步增强正常信息,在MVTec-AD和VisA数据集上分别以65.12/59.33和40.06/32.50的pAP/DSC指标实现了最先进的异常分割性能。
English: The paper introduces Unified Anomaly Segmentation (UniAS), a multi-level hybrid pipeline that enhances normal information from coarse to fine using a novel multi-granularity gated CNN integrated with Transformer layers, achieving state-of-the-art performance in anomaly segmentation on MVTec-AD and VisA datasets with pAP/DSC scores of 65.12/59.33 and 40.06/32.50 respectively.

Authors:Maosong Cao, Taolin Zhang, Mo Li, Chuyu Zhang, Yunxin Liu, Haodong Duan, Songyang Zhang, Kai Chen
Title: Condor: Enhance LLM Alignment with Knowledge-Driven Data Synthesis and Refinement
Abstract:
The quality of Supervised Fine-Tuning (SFT) data plays a critical role in enhancing the conversational capabilities of Large Language Models (LLMs). However, as LLMs become more advanced, the availability of high-quality human-annotated SFT data has become a significant bottleneck, necessitating a greater reliance on synthetic training data. In this work, we introduce Condor, a novel two-stage synthetic data generation framework that incorporates World Knowledge Tree and Self-Reflection Refinement to produce high-quality SFT data at scale. Our experimental results demonstrate that a base model fine-tuned on only 20K Condor-generated samples achieves superior performance compared to counterparts. The additional refinement stage in Condor further enables iterative self-improvement for LLMs at various scales (up to 72B), validating the effectiveness of our approach. Furthermore, our investigation into the scaling for synthetic data in post-training reveals substantial unexplored potential for performance improvements, opening promising avenues for future research.
中文: Condor框架通过世界知识树和自我反思优化两阶段方法生成高质量合成SFT数据,仅需2万样本微调的基础模型即可超越同类模型,同时展现出可扩展的自我改进能力,为后续研究开辟了新方向。
English: The Condor framework introduces a novel two-stage approach using World Knowledge Tree and Self-Reflection Refinement to generate high-quality synthetic SFT data, enabling LLMs fine-tuned with just 20K samples to outperform counterparts while demonstrating scalable self-improvement capabilities.

Authors:Yihang Chen, Qianyi Wu, Weiyao Lin, Mehrtash Harandi, Jianfei Cai
Title: HAC++: Towards 100X Compression of 3D Gaussian Splatting
Abstract:
3D Gaussian Splatting (3DGS) has emerged as a promising framework for novel view synthesis, boasting rapid rendering speed with high fidelity. However, the substantial Gaussians and their associated attributes necessitate effective compression techniques. Nevertheless, the sparse and unorganized nature of the point cloud of Gaussians (or anchors in our paper) presents challenges for compression. To achieve a compact size, we propose HAC++, which leverages the relationships between unorganized anchors and a structured hash grid, utilizing their mutual information for context modeling. Additionally, HAC++ captures intra-anchor contextual relationships to further enhance compression performance. To facilitate entropy coding, we utilize Gaussian distributions to precisely estimate the probability of each quantized attribute, where an adaptive quantization module is proposed to enable high-precision quantization of these attributes for improved fidelity restoration. Moreover, we incorporate an adaptive masking strategy to eliminate invalid Gaussians and anchors. Overall, HAC++ achieves a remarkable size reduction of over 100X compared to vanilla 3DGS when averaged on all datasets, while simultaneously improving fidelity. It also delivers more than 20X size reduction compared to Scaffold-GS. Our code is available at https://github.com/YihangChen-ee/HAC-plus.
Chinese: HAC++ 是一种针对 D 高斯泼溅的新型压缩方法,利用锚点与哈希网格间的互信息并捕捉锚点内部关联,相比原始 3DGS 实现了超过 100 倍的体积压缩且保真度更高。
English: HAC++ is a novel compression method for 3D Gaussian Splatting that leverages mutual information between anchors and a hash grid while capturing intra-anchor relationships, achieving over 100x size reduction with improved fidelity compared to vanilla 3DGS.

Authors:Junyu Xia, Jiesong Bai, Yihang Dong
Title: DLEN: Dual Branch of Transformer for Low-Light Image Enhancement in Dual Domains
Abstract:
Low-light image enhancement (LLE) aims to improve the visual quality of images captured in poorly lit conditions, which often suffer from low brightness, low contrast, noise, and color distortions. These issues hinder the performance of computer vision tasks such as object detection, facial recognition, and autonomous driving.Traditional enhancement techniques, such as multi-scale fusion and histogram equalization, fail to preserve fine details and often struggle with maintaining the natural appearance of enhanced images under complex lighting conditions. Although the Retinex theory provides a foundation for image decomposition, it often amplifies noise, leading to suboptimal image quality. In this paper, we propose the Dual Light Enhance Network (DLEN), a novel architecture that incorporates two distinct attention mechanisms, considering both spatial and frequency domains. Our model introduces a learnable wavelet transform module in the illumination estimation phase, preserving high- and low-frequency components to enhance edge and texture details. Additionally, we design a dual-branch structure that leverages the power of the Transformer architecture to enhance both the illumination and structural components of the image.Through extensive experiments, our model outperforms state-of-the-art methods on standard benchmarks.Code is available here: https://github.com/LaLaLoXX/DLEN
Chinese: 本文提出的双重光增强网络(DLEN)融合了空间与频域注意力机制,采用可学习小波变换和双分支Transformer结构,通过优化光照并保留细节来有效增强低光图像,其性能超越了现有先进方法。
English: The paper introduces the Dual Light Enhance Network (DLEN), which integrates spatial and frequency domain attention with a learnable wavelet transform and dual-branch Transformer structure to effectively enhance low-light images by improving illumination and preserving details, outperforming existing methods.

Authors:Kazi Hasan Ibn Arif, Sajib Acharjee Dip, Khizar Hussain, Lang Zhang, Chris Thomas
Title: PAINT: Paying Attention to INformed Tokens to Mitigate Hallucination in Large Vision-Language Model
Abstract:
Large Vision Language Models (LVLMs) have demonstrated remarkable capabilities in understanding and describing visual content, achieving state-of-the-art performance across various vision-language tasks. However, these models often generate descriptions containing objects or details that are absent in the input image, a phenomenon commonly known as hallucination. Our work investigates the key reasons behind this issue by analyzing the pattern of self-attention in transformer layers. We find that hallucinations often arise from the progressive weakening of attention weight to visual tokens in the deeper layers of the LLM. Some previous works naively boost the attention of all visual tokens to mitigate this issue, resulting in suboptimal hallucination reduction. To address this, we identify two critical sets of visual tokens that facilitate the transfer of visual information from the vision encoder to the LLM. Local tokens encode grounded information about objects present in an image, while summary tokens capture the overall aggregated representation of the image. Importantly, these two sets of tokens require different levels of weight enhancement. To this end, we propose \textbf{PAINT} (\textbf{P}aying \textbf{A}ttention to \textbf{IN}formed \textbf{T}okens), a plug-and-play framework that intervenes in the self-attention mechanism of the LLM, selectively boosting the attention weights of local and summary tokens with experimentally learned margins. Evaluation on the MSCOCO image captioning dataset demonstrate that our approach reduces hallucination rates by up to 62.3\% compared to baseline models while maintaining accuracy. Code is available at \href{https://github.com/hasanar1f/PAINT}{https://github.com/hasanar1f/PAINT}
中文:PAINT框架通过选择性增强局部和摘要视觉标记的注意力权重,将LVLM的幻觉率降低高达62.3%,同时保持准确性。
English: The PAINT framework selectively enhances attention weights for local and summary visual tokens in LVLMs to reduce hallucinations by up to 62.3% while preserving accuracy.

Authors:Zibo Zhao, Zeqiang Lai, Qingxiang Lin, Yunfei Zhao, Haolin Liu, Shuhui Yang, Yifei Feng, Mingxin Yang, Sheng Zhang, Xianghui Yang, Huiwen Shi, Sicong Liu, Junta Wu, Yihang Lian, Fan Yang, Ruining Tang, Zebin He, Xinzhou Wang, Jian Liu, Xuhui Zuo, Zhuo Chen, Biwen Lei, Haohan Weng, Jing Xu, Yiling Zhu, Xinhai Liu, Lixin Xu, Changrong Hu, Shaoxiong Yang, Song Zhang, Yang Liu, Tianyu Huang, Lifu Wang, Jihong Zhang, Meng Chen, Liang Dong, Yiwen Jia, Yulin Cai, Jiaao Yu, Yixuan Tang, Hao Zhang, Zheng Ye, Peng He, Runzhou Wu, Chao Zhang, Yonghao Tan, Jie Xiao, Yangyu Tao, Jianchen Zhu, Jinbao Xue, Kai Liu, Chongqing Zhao, Xinming Wu, Zhichao Hu, Lei Qin, Jianbing Peng, Zhan Li, Minghui Chen, Xipeng Zhang, Lin Niu, Paige Wang, Yingkai Wang, Haozhao Kuang, Zhongyi Fan, Xu Zheng, Weihao Zhuang, YingPing He, Tian Liu, Yong Yang, Di Wang, Yuhong Liu, Jie Jiang, Jingwei Huang, Chunchao Guo
Title: Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation
Abstract:
We present Hunyuan3D 2.0, an advanced large-scale 3D synthesis system for generating high-resolution textured 3D assets. This system includes two foundation components: a large-scale shape generation model -- Hunyuan3D-DiT, and a large-scale texture synthesis model -- Hunyuan3D-Paint. The shape generative model, built on a scalable flow-based diffusion transformer, aims to create geometry that properly aligns with a given condition image, laying a solid foundation for downstream applications. The texture synthesis model, benefiting from strong geometric and diffusion priors, produces high-resolution and vibrant texture maps for either generated or hand-crafted meshes. Furthermore, we build Hunyuan3D-Studio -- a versatile, user-friendly production platform that simplifies the re-creation process of 3D assets. It allows both professional and amateur users to manipulate or even animate their meshes efficiently. We systematically evaluate our models, showing that Hunyuan3D 2.0 outperforms previous state-of-the-art models, including the open-source models and closed-source models in geometry details, condition alignment, texture quality, and etc. Hunyuan3D 2.0 is publicly released in order to fill the gaps in the open-source 3D community for large-scale foundation generative models. The code and pre-trained weights of our models are available at: https://github.com/Tencent/Hunyuan3D-2
中文摘要:Hunyuan3D 2.0是一个先进的3D合成系统,包含形状生成和纹理合成两大基础模型,在几何细节与纹理质量上超越现有技术,已开源以填补大规模3D生成模型的空白。
English Summary: Hunyuan3D 2.0 is an advanced 3D synthesis system featuring a shape generation model and texture synthesis model that outperforms existing models in geometry and texture quality, released publicly to advance open-source 3D generative models.

Authors:Geonwoo Seo
Title: An End-to-End Approach for Korean Wakeword Systems with Speaker Authentication
Abstract:
Wakeword detection plays a critical role in enabling AI assistants to listen to user voices and interact effectively. However, for languages other than English, there is a significant lack of pre-trained wakeword models. Additionally, systems that merely determine the presence of a wakeword can pose serious privacy concerns. In this paper, we propose an end-to-end approach that trains wakewords for Non-English languages, particulary Korean, and uses this to develop a Voice Authentication model to protect user privacy. Our implementation employs an open-source platform OpenWakeWord, which performs wakeword detection using an FCN (Fully-Connected Network) architecture. Once a wakeword is detected, our custom-developed code calculates cosine similarity for robust user authentication. Experimental results demonstrate the effectiveness of our approach, achieving a 16.79% and a 6.6% Equal Error Rate (EER) each in the Wakeword Detection and the Voice Authentication. These findings highlight the model's potential in providing secure and accurate wakeword detection and authentication for Korean users.
中文: 本文提出了一种端到端的方法,针对韩语等非英语语言训练唤醒词检测,并结合语音认证以解决隐私问题,在唤醒词检测和用户验证方面均取得了良好效果。
English: This paper introduces an end-to-end approach for training non-English wakeword detection, specifically for Korean, and integrates it with voice authentication to address privacy concerns, achieving effective performance in both wakeword detection and user verification.

Authors:Stefan Lenz, Arsenij Ustjanzew, Marco Jeray, Meike Ressing, Torsten Panholzer
Title: Can open source large language models be used for tumor documentation in Germany? -- An evaluation on urological doctors' notes
Abstract:
Tumor documentation in Germany is largely done manually, requiring reading patient records and entering data into structured databases. Large language models (LLMs) could potentially enhance this process by improving efficiency and reliability. This evaluation tests eleven different open source LLMs with sizes ranging from 1-70 billion model parameters on three basic tasks of the tumor documentation process: identifying tumor diagnoses, assigning ICD-10 codes, and extracting the date of first diagnosis. For evaluating the LLMs on these tasks, a dataset of annotated text snippets based on anonymized doctors' notes from urology was prepared. Different prompting strategies were used to investigate the effect of the number of examples in few-shot prompting and to explore the capabilities of the LLMs in general. The models Llama 3.1 8B, Mistral 7B, and Mistral NeMo 12 B performed comparably well in the tasks. Models with less extensive training data or having fewer than 7 billion parameters showed notably lower performance, while larger models did not display performance gains. Examples from a different medical domain than urology could also improve the outcome in few-shot prompting, which demonstrates the ability of LLMs to handle tasks needed for tumor documentation. Open source LLMs show a strong potential for automating tumor documentation. Models from 7-12 billion parameters could offer an optimal balance between performance and resource efficiency. With tailored fine-tuning and well-designed prompting, these models might become important tools for clinical documentation in the future. The code for the evaluation is available from https://github.com/stefan-m-lenz/UroLlmEval. We also release the dataset as a new valuable resource that addresses the shortage of authentic and easily accessible benchmarks in German-language medical NLP.
中文摘要:本研究评估了11种开源大语言模型在德国肿瘤文档自动化任务中的表现,发现70-120亿参数的模型(如Llama 3.1 8B和Mistral 7B)在性能与资源效率间达到最佳平衡,通过针对性提示策略展现出临床应用的巨大潜力。
English Summary: This study evaluates eleven open-source large language models for automating tumor documentation tasks in Germany, finding that models with 7-12 billion parameters like Llama 3.1 8B and Mistral 7B offer optimal performance-resource balance while demonstrating strong potential for clinical use through tailored prompting strategies.

Authors:Hamid Nasiri, Peter Garraghan
Title: EDoRA: Efficient Weight-Decomposed Low-Rank Adaptation via Singular Value Decomposition
Abstract:
Parameter-efficient fine-tuning methods, such as LoRA, reduces the number of trainable parameters. However, they often suffer from scalability issues and differences between their learning pattern and full fine-tuning. To overcome these limitations, we propose Efficient Weight-Decomposed Low-Rank Adaptation (EDoRA): a novel PEFT method that decomposes pre-trained weights into magnitude and directional components. By freezing low-rank matrices, initializing them by singular value decomposition, and introducing a small trainable matrix between them, EDoRA achieves substantial reduction in trainable parameters while maintaining learning capacity. Experimental results on the GLUE benchmark demonstrate that EDoRA achieves competitive or superior performance compared to state-of-the-art methods, such as LoRA and DoRA, with up to 30x fewer trainable parameters. This makes EDoRA a highly efficient solution for adapting LLMs to diverse tasks under memory-constrained settings. Code is available at https://github.com/Hamid-Nasiri/EDoRA .
Chinese: 提出的EDoRA方法通过分解权重并采用低秩适应,有效减少可训练参数,在性能上与LoRA等现有方法相当甚至更优,同时参数数量减少高达30倍。
English: The proposed EDoRA method efficiently reduces trainable parameters by decomposing weights and using low-rank adaptations, achieving competitive performance with up to 30x fewer parameters than existing methods like LoRA.

Authors:Liam Chalcroft, Jenny Crinion, Cathy J. Price, John Ashburner
Title: Unified 3D MRI Representations via Sequence-Invariant Contrastive Learning
Abstract:
Self-supervised deep learning has accelerated 2D natural image analysis but remains difficult to translate into 3D MRI, where data are scarce and pre-trained 2D backbones cannot capture volumetric context. We present a \emph{sequence-invariant} self-supervised framework leveraging quantitative MRI (qMRI). By simulating multiple MRI contrasts from a single 3D qMRI scan and enforcing consistent representations across these contrasts, we learn anatomy-centric rather than sequence-specific features. The result is a single 3D encoder that excels across tasks and protocols. Experiments on healthy brain segmentation (IXI), stroke lesion segmentation (ARC), and MRI denoising show significant gains over baseline SSL approaches, especially in low-data settings (up to +8.3\% Dice, +4.2 dB PSNR). It also generalises to unseen sites, supporting scalable clinical use. Code and trained models are publicly available at https://github.com/liamchalcroft/contrast-squared
中文: 本研究提出了一种序列不变的自监督框架,利用定量MRI模拟多种对比度,使单个3D编码器能够学习解剖学特征,在多种任务和协议中表现卓越,尤其在数据稀缺情况下优势显著。
English: This study introduces a sequence-invariant self-supervised framework that uses quantitative MRI to simulate multiple contrasts, enabling a single 3D encoder to learn anatomy-focused features and achieve superior performance across various tasks and protocols, particularly in low-data scenarios.

Authors:Jin Li, Shoujin Wang, Qi Zhang, Shui Yu, Fang Chen
Title: Generating with Fairness: A Modality-Diffused Counterfactual Framework for Incomplete Multimodal Recommendations
Abstract:
Incomplete scenario is a prevalent, practical, yet challenging setting in Multimodal Recommendations (MMRec), where some item modalities are missing due to various factors. Recently, a few efforts have sought to improve the recommendation accuracy by exploring generic structures from incomplete data. However, two significant gaps persist: 1) the difficulty in accurately generating missing data due to the limited ability to capture modality distributions; and 2) the critical but overlooked visibility bias, where items with missing modalities are more likely to be disregarded due to the prioritization of items' multimodal data over user preference alignment. This bias raises serious concerns about the fair treatment of items. To bridge these two gaps, we propose a novel Modality-Diffused Counterfactual (MoDiCF) framework for incomplete multimodal recommendations. MoDiCF features two key modules: a novel modality-diffused data completion module and a new counterfactual multimodal recommendation module. The former, equipped with a particularly designed multimodal generative framework, accurately generates and iteratively refines missing data from learned modality-specific distribution spaces. The latter, grounded in the causal perspective, effectively mitigates the negative causal effects of visibility bias and thus assures fairness in recommendations. Both modules work collaboratively to address the two aforementioned significant gaps for generating more accurate and fair results. Extensive experiments on three real-world datasets demonstrate the superior performance of MoDiCF in terms of both recommendation accuracy and fairness. The code and processed datasets are released at https://github.com/JinLi-i/MoDiCF.
Chinese: MoDiCF框架通过模态扩散的数据补全模块精确生成缺失数据,并利用反事实推荐模块减轻可见性偏差,从而解决不完整多模态推荐问题并确保公平性。
English: The MoDiCF framework addresses incomplete multimodal recommendations by accurately generating missing data through a modality-diffused completion module and ensuring fairness via a counterfactual recommendation module that mitigates visibility bias.

Authors:Le Thien Phuc Nguyen, Zhuoran Yu, Yong Jae Lee
Title: LASER: Lip Landmark Assisted Speaker Detection for Robustness
Abstract:
Active Speaker Detection (ASD) aims to identify speaking individuals in complex visual scenes. While humans can easily detect speech by matching lip movements to audio, current ASD models struggle to establish this correspondence, often misclassifying non-speaking instances when audio and lip movements are unsynchronized. To address this limitation, we propose Lip landmark Assisted Speaker dEtection for Robustness (LASER). Unlike models that rely solely on facial frames, LASER explicitly focuses on lip movements by integrating lip landmarks in training. Specifically, given a face track, LASER extracts frame-level visual features and the 2D coordinates of lip landmarks using a lightweight detector. These coordinates are encoded into dense feature maps, providing spatial and structural information on lip positions. Recognizing that landmark detectors may sometimes fail under challenging conditions (e.g., low resolution, occlusions, extreme angles), we incorporate an auxiliary consistency loss to align predictions from both lip-aware and face-only features, ensuring reliable performance even when lip data is absent. Extensive experiments across multiple datasets show that LASER outperforms state-of-the-art models, especially in scenarios with desynchronized audio and visuals, demonstrating robust performance in real-world video contexts. Code is available at \url{https://github.com/plnguyen2908/LASER_ASD}.
中文摘要:提出的LASER模型通过整合唇部关键点和辅助一致性损失,显著提升了主动说话人检测的鲁棒性,在多种数据集中尤其在视听不同步场景下表现卓越。
English Summary: The proposed LASER model enhances Active Speaker Detection by integrating lip landmarks and an auxiliary consistency loss, achieving superior robustness and performance in handling audio-visual desynchronization across diverse datasets.

Authors:Jesse Morris, Yiduo Wang, Mikolaj Kliniewski, Viorela Ila
Title: DynoSAM: Open-Source Smoothing and Mapping Framework for Dynamic SLAM
Abstract:
Traditional Visual Simultaneous Localization and Mapping (vSLAM) systems focus solely on static scene structures, overlooking dynamic elements in the environment. Although effective for accurate visual odometry in complex scenarios, these methods discard crucial information about moving objects. By incorporating this information into a Dynamic SLAM framework, the motion of dynamic entities can be estimated, enhancing navigation whilst ensuring accurate localization. However, the fundamental formulation of Dynamic SLAM remains an open challenge, with no consensus on the optimal approach for accurate motion estimation within a SLAM pipeline. Therefore, we developed DynoSAM, an open-source framework for Dynamic SLAM that enables the efficient implementation, testing, and comparison of various Dynamic SLAM optimization formulations. DynoSAM integrates static and dynamic measurements into a unified optimization problem solved using factor graphs, simultaneously estimating camera poses, static scene, object motion or poses, and object structures. We evaluate DynoSAM across diverse simulated and real-world datasets, achieving state-of-the-art motion estimation in indoor and outdoor environments, with substantial improvements over existing systems. Additionally, we demonstrate DynoSAM utility in downstream applications, including 3D reconstruction of dynamic scenes and trajectory prediction, thereby showcasing potential for advancing dynamic object-aware SLAM systems. DynoSAM is open-sourced at https://github.com/ACFR-RPG/DynOSAM.
中文: 传统视觉SLAM系统忽略动态元素,而DynoSAM将其整合进统一优化框架,实现了卓越的运动估计,并支持动态场景三维重建和轨迹预测等应用。
English: Traditional vSLAM systems ignore dynamic elements, but DynoSAM integrates them into a unified optimization framework, achieving superior motion estimation and enabling applications like 3D reconstruction and trajectory prediction.

Authors:Yang Wang, Haipeng Liu, Zeqian Yi, Biao Qian, Meng Wang
Title: Coarse-to-Fine Lightweight Meta-Embedding for ID-Based Recommendation
Abstract:
The state-of-the-art recommendation systems have shifted the attention to efficient recommendation, e.g., on-device recommendation, under memory constraints. To this end, the existing methods either focused on the lightweight embeddings for both users and items, or involved on-device systems enjoying the compact embeddings to enhance reusability and reduces space complexity. However, they focus solely on the coarse granularity of embedding, while overlook the fine-grained semantic nuances, to adversarially downgrade the efficacy of meta-embeddings in capturing the intricate relationship over both user and item, consequently resulting into the suboptimal recommendations. In this paper, we aim to study how the meta-embedding can efficiently learn varied grained semantics, together with how the fine-grained meta-embedding can strengthen the representation of coarse-grained meta-embedding. To answer these questions, we develop a novel graph neural networks (GNNs) based recommender where each user and item serves as the node, linked directly to coarse-grained virtual nodes and indirectly to fine-grained virtual nodes, ensuring different grained semantic learning, while disclosing: 1) In contrast to coarse-grained semantics, fine-grained semantics are well captured through sparse meta-embeddings, which adaptively 2) balance the embedding uniqueness and memory constraint. Additionally, the initialization method come up upon SparsePCA, along with a soft thresholding activation function to render the sparseness of the meta-embeddings. We propose a weight bridging update strategy that focuses on matching each coarse-grained meta-embedding with several fine-grained meta-embeddings based on the users/items' semantics. Extensive experiments substantiate our method's superiority over existing baselines. Our code is available at https://github.com/htyjers/C2F-MetaEmbed.
中文: 本文提出了一种基于图神经网络的推荐系统,通过元嵌入捕捉粗粒度和细粒度语义,利用稀疏表示和新颖的权重桥接策略,在内存限制下提升推荐准确性。
English: This paper introduces a graph neural network-based recommender system that captures both coarse and fine-grained semantics through meta-embeddings, using sparse representations and a novel weight bridging strategy to enhance recommendation accuracy under memory constraints.

Authors:Moslem Heidarpur, Mitra Mirhassani, Norman Chang
Title: A Fully Pipelined FIFO Based Polynomial Multiplication Hardware Architecture Based On Number Theoretic Transform
Abstract:
This paper presents digital hardware for computing polynomial multiplication using Number Theoretic Transform (NTT), specifically designed for implementation on Field Programmable Gate Arrays (FPGAs). Multiplying two large polynomials applies to many modern encryption schemes, including those based on Ring Learning with Error (RLWE). The proposed design uses First In, First Out (FIFO) buffers to make the design fully pipelined and capable of computing two n degree polynomials in n/2 clock cycles. This hardware proposes a two-fold reduction in the processing time of polynomial multiplication compared to state-of-the-art enabling twice as much encryption in the same amount of time. Despite that, the proposed hardware utilizes fewer resources than the fastest-reported work.
中文: 本文提出了一种基于数论变换和先进先出缓冲器的全流水线现场可编程门阵列硬件,在减少资源使用的同时将多项式乘法处理时间减半,使基于环上容错学习的加密方案吞吐量翻倍。
English: This paper introduces a fully pipelined FPGA-based hardware using NTT and FIFO buffers that halves polynomial multiplication processing time while using fewer resources, enabling double the encryption throughput for RLWE-based schemes.

Authors:Zhili Cheng, Yuge Tu, Ran Li, Shiqi Dai, Jinyi Hu, Shengding Hu, Jiahao Li, Yang Shi, Tianyu Yu, Weize Chen, Lei Shi, Maosong Sun
Title: EmbodiedEval: Evaluate Multimodal LLMs as Embodied Agents
Abstract:
Multimodal Large Language Models (MLLMs) have shown significant advancements, providing a promising future for embodied agents. Existing benchmarks for evaluating MLLMs primarily utilize static images or videos, limiting assessments to non-interactive scenarios. Meanwhile, existing embodied AI benchmarks are task-specific and not diverse enough, which do not adequately evaluate the embodied capabilities of MLLMs. To address this, we propose EmbodiedEval, a comprehensive and interactive evaluation benchmark for MLLMs with embodied tasks. EmbodiedEval features 328 distinct tasks within 125 varied 3D scenes, each of which is rigorously selected and annotated. It covers a broad spectrum of existing embodied AI tasks with significantly enhanced diversity, all within a unified simulation and evaluation framework tailored for MLLMs. The tasks are organized into five categories: navigation, object interaction, social interaction, attribute question answering, and spatial question answering to assess different capabilities of the agents. We evaluated the state-of-the-art MLLMs on EmbodiedEval and found that they have a significant shortfall compared to human level on embodied tasks. Our analysis demonstrates the limitations of existing MLLMs in embodied capabilities, providing insights for their future development. We open-source all evaluation data and simulation framework at https://github.com/thunlp/EmbodiedEval.
Chinese: EmbodiedEval 是一个全面的评估基准,通过125个多样化3D场景中的328项交互任务来测试多模态大语言模型,揭示了其在具身能力方面相比人类表现存在的显著不足。
English: EmbodiedEval is a comprehensive benchmark designed to assess Multimodal Large Language Models (MLLMs) through 328 interactive tasks across 125 diverse 3D scenes, revealing their significant limitations in embodied capabilities compared to human performance.

Authors:Riqiang Gao, Mamadou Diallo, Han Liu, Anthony Magliari, Jonathan Sackett, Wilko Verbakel, Sandra Meyers, Rafe Mcbeth, Masoud Zarepisheh, Simon Arberet, Martin Kraus, Florin C. Ghesu, Ali Kamen
Title: Automating High Quality RT Planning at Scale
Abstract:
Radiotherapy (RT) planning is complex, subjective, and time-intensive. Advances with artificial intelligence (AI) promise to improve its precision and efficiency, but progress is often limited by the scarcity of large, standardized datasets. To address this, we introduce the Automated Iterative RT Planning (AIRTP) system, a scalable solution for generating high-quality treatment plans. This scalable solution is designed to generate substantial volumes of consistently high-quality treatment plans, overcoming a key obstacle in the advancement of AI-driven RT planning. Our AIRTP pipeline adheres to clinical guidelines and automates essential steps, including organ-at-risk (OAR) contouring, helper structure creation, beam setup, optimization, and plan quality improvement, using AI integrated with RT planning software like Varian Eclipse. Furthermore, a novel approach for determining optimization parameters to reproduce 3D dose distributions, i.e. a method to convert dose predictions to deliverable treatment plans constrained by machine limitations is proposed. A comparative analysis of plan quality reveals that our automated pipeline produces treatment plans of quality comparable to those generated manually, which traditionally require several hours of labor per plan. Committed to public research, the first data release of our AIRTP pipeline includes nine cohorts covering head-and-neck and lung cancer sites to support an AAPM 2025 challenge. To our best knowledge, this dataset features more than 10 times number of plans compared to the largest existing well-curated public dataset. Repo: https://github.com/RiqiangGao/GDP-HMM_AAPMChallenge.
中文: AIRTP系统利用人工智能自动化放疗规划,通过可扩展的流程生成高质量且临床可比的治疗方案,同时发布大规模公共数据集以解决数据稀缺问题,推动研究发展。
English: The AIRTP system automates radiotherapy planning using AI to generate high-quality, clinically comparable treatment plans efficiently, addressing data scarcity with a scalable pipeline and releasing a large public dataset for research.

Authors:Riqiang Gao, Mamadou Diallo, Han Liu, Anthony Magliari, Jonathan Sackett, Wilko Verbakel, Sandra Meyers, Rafe Mcbeth, Masoud Zarepisheh, Simon Arberet, Martin Kraus, Florin C. Ghesu, Ali Kamen
Title: Automating RT Planning at Scale: High Quality Data For AI Training
Abstract:
Radiotherapy (RT) planning is complex, subjective, and time-intensive. Advances with artificial intelligence (AI) promise to improve its precision and efficiency, but progress is often limited by the scarcity of large, standardized datasets. To address this, we introduce the Automated Iterative RT Planning (AIRTP) system, a scalable solution for generating high-quality treatment plans. This scalable solution is designed to generate substantial volumes of consistently high-quality treatment plans, overcoming a key obstacle in the advancement of AI-driven RT planning. Our AIRTP pipeline adheres to clinical guidelines and automates essential steps, including organ-at-risk (OAR) contouring, helper structure creation, beam setup, optimization, and plan quality improvement, using AI integrated with RT planning software like Varian Eclipse. Furthermore, a novel approach for determining optimization parameters to reproduce 3D dose distributions, i.e. a method to convert dose predictions to deliverable treatment plans constrained by machine limitations is proposed. A comparative analysis of plan quality reveals that our automated pipeline produces treatment plans of quality comparable to those generated manually, which traditionally require several hours of labor per plan. Committed to public research, the first data release of our AIRTP pipeline includes nine cohorts covering head-and-neck and lung cancer sites to support an AAPM 2025 challenge. To our best knowledge, this dataset features more than 10 times number of plans compared to the largest existing well-curated public dataset. Repo: https://github.com/RiqiangGao/GDP-HMM_AAPMChallenge.
中文: AIRTP系统利用人工智能自动化放疗规划,通过可扩展的流程生成高质量且临床可比的治疗方案,同时发布大规模公共数据集以解决数据稀缺问题,推动研究发展。
English: The AIRTP system automates radiotherapy planning using AI to generate high-quality, clinically comparable treatment plans efficiently, addressing data scarcity with a scalable pipeline and releasing a large public dataset for research.

Authors:Pouya Hamadanian, Sadjad Fouladi
Title: Glinthawk: A Two-Tiered Architecture for Offline LLM Inference
Abstract:
We introduce Glinthawk, an architecture for offline Large Language Model (LLM) inference. By leveraging a two-tiered structure, Glinthawk optimizes the utilization of the high-end accelerators ("Tier 1") by offloading the attention mechanism to lower-end compute tier ("Tier 2"). This separation allows the memory demand of the attention, known as the key-value cache, to scale independently from the model weights, enabling larger batch sizes and more efficient accelerator usage. Prototyped with NVIDIA T4 GPUs and standard CPU VMs, Glinthawk improves throughput by $5.9\times$ and reduces cost of generation by $2.8\times$, compared to paged attention baselines. For long sequence lengths, it achieves $16.3\times$ throughput improvement at $2.4\times$ less cost. Our evaluation shows that this architecture can tolerate moderate network latency with minimal performance degradation, making it highly effective for latency-tolerant, throughput-focused applications such as batch processing. The prototype is publicly available at https://github.com/microsoft/glinthawk.
中文:Glinthawk是一种离线大语言模型推理架构,通过将注意力机制与模型权重分离,实现了可扩展的键值缓存管理和高效加速器利用,从而显著提升吞吐量并降低成本。
English: Glinthawk is an offline LLM inference architecture that enhances throughput and reduces costs by separating attention mechanisms from model weights, allowing scalable key-value cache management and efficient accelerator use.

Authors:Fatemeh Nazary, Yashar Deldjoo, Tommaso di Noia
Title: Poison-RAG: Adversarial Data Poisoning Attacks on Retrieval-Augmented Generation in Recommender Systems
Abstract:
This study presents Poison-RAG, a framework for adversarial data poisoning attacks targeting retrieval-augmented generation (RAG)-based recommender systems. Poison-RAG manipulates item metadata, such as tags and descriptions, to influence recommendation outcomes. Using item metadata generated through a large language model (LLM) and embeddings derived via the OpenAI API, we explore the impact of adversarial poisoning attacks on provider-side, where attacks are designed to promote long-tail items and demote popular ones. Two attack strategies are proposed: local modifications, which personalize tags for each item using BERT embeddings, and global modifications, applying uniform tags across the dataset. Experiments conducted on the MovieLens dataset in a black-box setting reveal that local strategies improve manipulation effectiveness by up to 50\%, while global strategies risk boosting already popular items. Results indicate that popular items are more susceptible to attacks, whereas long-tail items are harder to manipulate. Approximately 70\% of items lack tags, presenting a cold-start challenge; data augmentation and synthesis are proposed as potential defense mechanisms to enhance RAG-based systems' resilience. The findings emphasize the need for robust metadata management to safeguard recommendation frameworks. Code and data are available at https://github.com/atenanaz/Poison-RAG.
中文: 本研究提出Poison-RAG框架,通过对RAG推荐系统实施对抗性数据投毒攻击来操纵项目元数据以影响推荐结果,实验表明局部攻击策略能显著提升操纵效果,同时揭示了热门项目的脆弱性以及加强元数据管理的必要性。
English: This study introduces Poison-RAG, a framework for adversarial data poisoning attacks on RAG-based recommender systems that manipulates item metadata to influence recommendations, revealing through experiments that local attack strategies significantly enhance manipulation effectiveness while highlighting vulnerabilities in popular items and the need for robust metadata management.

Authors:Anwai Archit, Luca Freckmann, Constantin Pape
Title: MedicoSAM: Towards foundation models for medical image segmentation
Abstract:
Medical image segmentation is an important analysis task in clinical practice and research. Deep learning has massively advanced the field, but current approaches are mostly based on models trained for a specific task. Training such models or adapting them to a new condition is costly due to the need for (manually) labeled data. The emergence of vision foundation models, especially Segment Anything, offers a path to universal segmentation for medical images, overcoming these issues. Here, we study how to improve Segment Anything for medical images by comparing different finetuning strategies on a large and diverse dataset. We evaluate the finetuned models on a wide range of interactive and (automatic) semantic segmentation tasks. We find that the performance can be clearly improved for interactive segmentation. However, semantic segmentation does not benefit from pretraining on medical images. Our best model, MedicoSAM, is publicly available at https://github.com/computational-cell-analytics/medico-sam. We show that it is compatible with existing tools for data annotation and believe that it will be of great practical value.
中文摘要:本研究通过多种微调策略优化Segment Anything模型在医学图像上的应用,显著提升了交互式分割性能,并发布MedicoSAM作为实用的数据标注工具。
English Summary: The study enhances the Segment Anything model for medical images through diverse finetuning strategies, achieving significant improvements in interactive segmentation and releasing MedicoSAM as a practical tool for data annotation.

Authors:Saeid Asgari Taghanaki, Joao Monteiro
Title: Explain-Query-Test: Self-Evaluating LLMs Via Explanation and Comprehension Discrepancy
Abstract:
Large language models (LLMs) have demonstrated remarkable proficiency in generating detailed and coherent explanations of complex concepts. However, the extent to which these models truly comprehend the concepts they articulate remains unclear. To assess the level of comprehension of a model relative to the content it generates, we implemented a self-evaluation pipeline where models: (i) given a topic generate an excerpt with information about the topic, (ii) given an excerpt generate question-answer pairs, and finally (iii) given a question generate an answer. We refer to this self-evaluation approach as Explain-Query-Test (EQT). Interestingly, the accuracy on generated questions resulting from running the EQT pipeline correlates strongly with the model performance as verified by typical benchmarks such as MMLU-Pro. In other words, EQT's performance is predictive of MMLU-Pro's, and EQT can be used to rank models without the need for any external source of evaluation data other than lists of topics of interest. Moreover, our results reveal a disparity between the models' ability to produce detailed explanations and their performance on questions related to those explanations. This gap highlights fundamental limitations in the internal knowledge representation and reasoning abilities of current LLMs. We release the code at https://github.com/asgsaeid/EQT.
中文: 大型语言模型能生成详细解释但缺乏真正理解,通过解释-查询-测试方法揭示其解释能力与推理能力之间存在差距。
English: Large language models can generate detailed explanations but lack true comprehension, as shown by the Explain-Query-Test method, which reveals a gap between their explanatory and reasoning abilities.

Authors:Jiebin Yan, Jiale Rao, Junjie Chen, Ziwen Tan, Weide Liu, Yuming Fang
Title: Multitask Auxiliary Network for Perceptual Quality Assessment of Non-Uniformly Distorted Omnidirectional Images
Abstract:
Omnidirectional image quality assessment (OIQA) has been widely investigated in the past few years and achieved much success. However, most of existing studies are dedicated to solve the uniform distortion problem in OIQA, which has a natural gap with the non-uniform distortion problem, and their ability in capturing non-uniform distortion is far from satisfactory. To narrow this gap, in this paper, we propose a multitask auxiliary network for non-uniformly distorted omnidirectional images, where the parameters are optimized by jointly training the main task and other auxiliary tasks. The proposed network mainly consists of three parts: a backbone for extracting multiscale features from the viewport sequence, a multitask feature selection module for dynamically allocating specific features to different tasks, and auxiliary sub-networks for guiding the proposed model to capture local distortion and global quality change. Extensive experiments conducted on two large-scale OIQA databases demonstrate that the proposed model outperforms other state-of-the-art OIQA metrics, and these auxiliary sub-networks contribute to improve the performance of the proposed model. The source code is available at https://github.com/RJL2000/MTAOIQA.
Chinese: 本文提出了一种多任务辅助网络,通过联合训练主任务与辅助任务来改善非均匀失真全向图像的质量评估,其性能优于现有最先进方法。
English: This paper introduces a multitask auxiliary network that enhances omnidirectional image quality assessment by addressing non-uniform distortions through joint training of main and auxiliary tasks, outperforming existing methods.

Authors:Jiebin Yan, Jiale Rao, Xuelin Liu, Yuming Fang, Yifan Zuo, Weide Liu
Title: Subjective and Objective Quality Assessment of Non-Uniformly Distorted Omnidirectional Images
Abstract:
Omnidirectional image quality assessment (OIQA) has been one of the hot topics in IQA with the continuous development of VR techniques, and achieved much success in the past few years. However, most studies devote themselves to the uniform distortion issue, i.e., all regions of an omnidirectional image are perturbed by the ``same amount'' of noise, while ignoring the non-uniform distortion issue, i.e., partial regions undergo ``different amount'' of perturbation with the other regions in the same omnidirectional image. Additionally, nearly all OIQA models are verified on the platforms containing a limited number of samples, which largely increases the over-fitting risk and therefore impedes the development of OIQA. To alleviate these issues, we elaborately explore this topic from both subjective and objective perspectives. Specifically, we construct a large OIQA database containing 10,320 non-uniformly distorted omnidirectional images, each of which is generated by considering quality impairments on one or two camera len(s). Then we meticulously conduct psychophysical experiments and delve into the influence of both holistic and individual factors (i.e., distortion range and viewing condition) on omnidirectional image quality. Furthermore, we propose a perception-guided OIQA model for non-uniform distortion by adaptively simulating users' viewing behavior. Experimental results demonstrate that the proposed model outperforms state-of-the-art methods. The source code is available at https://github.com/RJL2000/OIQAND.
中文: 本研究通过构建包含大量非均匀失真全景图像的数据集,并开发一种模拟用户观看行为的感知引导模型,有效解决了全景图像质量评估中的局限性,其性能优于现有方法。
English: This study addresses the limitations in omnidirectional image quality assessment (OIQA) by creating a large database of non-uniformly distorted images and developing a perception-guided model that outperforms existing methods by simulating user viewing behavior.

Authors:Shu Zou, Xinyu Tian, Qinyu Zhao, Zhaoyuan Yang, Jing Zhang
Title: SimLabel: Consistency-Guided OOD Detection with Pretrained Vision-Language Models
Abstract:
Detecting out-of-distribution (OOD) data is crucial in real-world machine learning applications, particularly in safety-critical domains. Existing methods often leverage language information from vision-language models (VLMs) to enhance OOD detection by improving confidence estimation through rich class-wise text information. However, when building OOD detection score upon on in-distribution (ID) text-image affinity, existing works either focus on each ID class or whole ID label sets, overlooking inherent ID classes' connection. We find that the semantic information across different ID classes is beneficial for effective OOD detection. We thus investigate the ability of image-text comprehension among different semantic-related ID labels in VLMs and propose a novel post-hoc strategy called SimLabel. SimLabel enhances the separability between ID and OOD samples by establishing a more robust image-class similarity metric that considers consistency over a set of similar class labels. Extensive experiments demonstrate the superior performance of SimLabel on various zero-shot OOD detection benchmarks. The proposed model is also extended to various VLM-backbones, demonstrating its good generalization ability. Our demonstration and implementation codes are available at: https://github.com/ShuZou-1/SimLabel.
中文摘要:提出的SimLabel方法通过利用分布内类别之间的语义关系建立更鲁棒的图像类别相似度度量,从而提升视觉语言模型中的分布外检测性能,在多个基准测试中展现出优越表现。
English Summary: The proposed SimLabel method improves out-of-distribution detection in vision-language models by leveraging semantic relationships between in-distribution classes to create a more robust image-class similarity metric, demonstrating superior performance across multiple benchmarks.

Authors:Haoran Sun, Yekun Chai, Shuohuan Wang, Yu Sun, Hua Wu, Haifeng Wang
Title: Curiosity-Driven Reinforcement Learning from Human Feedback
Abstract:
Reinforcement learning from human feedback (RLHF) has proven effective in aligning large language models (LLMs) with human preferences, but often at the cost of reduced output diversity. This trade-off between diversity and alignment quality remains a significant challenge. Drawing inspiration from curiosity-driven exploration in reinforcement learning, we introduce curiosity-driven RLHF (CD-RLHF), a framework that incorporates intrinsic rewards for novel states, alongside traditional sparse extrinsic rewards, to optimize both output diversity and alignment quality. We demonstrate the effectiveness of CD-RLHF through extensive experiments on a range of tasks, including text summarization and instruction following. Our approach achieves significant gains in diversity on multiple diversity-oriented metrics while maintaining alignment with human preferences comparable to standard RLHF. We make our code publicly available at https://github.com/ernie-research/CD-RLHF.
中文: 提出的CD-RLHF框架通过将好奇心驱动的内在奖励与传统RLHF相结合,在保持与人类偏好对齐的同时,显著提升了语言模型输出多样性,并在多项任务中验证了其有效性。
English: The proposed CD-RLHF framework enhances output diversity in large language models by integrating curiosity-driven intrinsic rewards with traditional RLHF, achieving improved diversity while maintaining alignment with human preferences across various tasks.

Authors:Akash Kundu
Title: Improving thermal state preparation of Sachdev-Ye-Kitaev model with reinforcement learning on quantum hardware
Abstract:
The Sachdev-Ye-Kitaev (SYK) model, known for its strong quantum correlations and chaotic behavior, serves as a key platform for quantum gravity studies. However, variationally preparing thermal states on near-term quantum processors for large systems ($N>12$, where $N$ is the number of Majorana fermions) presents a significant challenge due to the rapid growth in the complexity of parameterized quantum circuits. This paper addresses this challenge by integrating reinforcement learning (RL) with convolutional neural networks, employing an iterative approach to optimize the quantum circuit and its parameters. The refinement process is guided by a composite reward signal derived from entropy and the expectation values of the SYK Hamiltonian. This approach reduces the number of CNOT gates by two orders of magnitude for systems $N\geq12$ compared to traditional methods like first-order Trotterization. We demonstrate the effectiveness of the RL framework in both noiseless and noisy quantum hardware environments, maintaining high accuracy in thermal state preparation. This work advances a scalable, RL-based framework with applications for quantum gravity studies and out-of-time-ordered thermal correlators computation in quantum many-body systems on near-term quantum hardware. The code is available at https://github.com/Aqasch/solving_SYK_model_with_RL.
中文: 本文提出了一种结合强化学习与卷积神经网络的方法,用于在近期量子硬件上高效制备SYK模型的热态,大幅降低了电路复杂度并保持了高精度。
English: This paper introduces a reinforcement learning framework combined with convolutional neural networks to efficiently prepare thermal states of the Sachdev-Ye-Kitaev model on near-term quantum hardware, significantly reducing circuit complexity while maintaining high accuracy.

Authors:Jing Liu, Zhenchao Ma, Zepu Wang, Chenxuanyin Zou, Jiayang Ren, Zehua Wang, Liang Song, Bo Hu, Yang Liu, Victor C. M. Leung
Title: A Survey on Diffusion Models for Anomaly Detection
Abstract:
Diffusion models (DMs) have emerged as a powerful class of generative AI models, showing remarkable potential in anomaly detection (AD) tasks across various domains, such as cybersecurity, fraud detection, healthcare, and manufacturing. The intersection of these two fields, termed diffusion models for anomaly detection (DMAD), offers promising solutions for identifying deviations in increasingly complex and high-dimensional data. In this survey, we review recent advances in DMAD research. We begin by presenting the fundamental concepts of AD and DMs, followed by a comprehensive analysis of classic DM architectures including DDPMs, DDIMs, and Score SDEs. We further categorize existing DMAD methods into reconstruction-based, density-based, and hybrid approaches, providing detailed examinations of their methodological innovations. We also explore the diverse tasks across different data modalities, encompassing image, time series, video, and multimodal data analysis. Furthermore, we discuss critical challenges and emerging research directions, including computational efficiency, model interpretability, robustness enhancement, edge-cloud collaboration, and integration with large language models. The collection of DMAD research papers and resources is available at https://github.com/fdjingliu/DMAD.
中文摘要:扩散模型已成为跨领域异常检测的强大工具,本综述对其方法进行分类,并探讨了计算效率与模型可解释性等关键挑战。
English Summary: Diffusion models have become a powerful tool for anomaly detection across various domains, with this survey categorizing methods and exploring challenges like computational efficiency and model interpretability.

Authors:Sahar Tahmasebi, David Ernst, Eric Müller-Budack, Ralph Ewerth
Title: Verifying Cross-modal Entity Consistency in News using Vision-language Models
Abstract:
The web has become a crucial source of information, but it is also used to spread disinformation, often conveyed through multiple modalities like images and text. The identification of inconsistent cross-modal information, in particular entities such as persons, locations, and events, is critical to detect disinformation. Previous works either identify out-of-context disinformation by assessing the consistency of images to the whole document, neglecting relations of individual entities, or focus on generic entities that are not relevant to news. So far, only few approaches have addressed the task of validating entity consistency between images and text in news. However, the potential of large vision-language models (LVLMs) has not been explored yet. In this paper, we propose an LVLM-based framework for verifying Cross-modal Entity Consistency~(LVLM4CEC), to assess whether persons, locations and events in news articles are consistent across both modalities. We suggest effective prompting strategies for LVLMs for entity verification that leverage reference images crawled from web. Moreover, we extend three existing datasets for the task of entity verification in news providing manual ground-truth data. Our results show the potential of LVLMs for automating cross-modal entity verification, showing improved accuracy in identifying persons and events when using evidence images. Moreover, our method outperforms a baseline for location and event verification in documents. The datasets and source code are available on GitHub at https://github.com/TIBHannover/LVLM4CEC.
中文: 本文提出LVLM4CEC框架,利用大型视觉语言模型验证新闻中人物、地点和事件在图文模态间的一致性,通过网页抓取的参考图像提升了验证准确性,并在多个实体类型上优于基线方法。
English: This paper introduces LVLM4CEC, a framework using large vision-language models to verify the consistency of entities like persons, locations, and events across images and text in news, demonstrating improved accuracy with web-crawled reference images and outperforming baselines.

Authors:Chung-ju Huang, Yuanpeng He, Xiao Han, Wenpin Jiao, Zhi Jin, Leye Wang
Title: UniTrans: A Unified Vertical Federated Knowledge Transfer Framework for Enhancing Cross-Hospital Collaboration
Abstract:
Cross-hospital collaboration has the potential to address disparities in medical resources across different regions. However, strict privacy regulations prohibit the direct sharing of sensitive patient information between hospitals. Vertical federated learning (VFL) offers a novel privacy-preserving machine learning paradigm that maximizes data utility across multiple hospitals. Traditional VFL methods, however, primarily benefit patients with overlapping data, leaving vulnerable non-overlapping patients without guaranteed improvements in medical prediction services. While some knowledge transfer techniques can enhance the prediction performance for non-overlapping patients, they fall short in addressing scenarios where overlapping and non-overlapping patients belong to different domains, resulting in challenges such as feature heterogeneity and label heterogeneity. To address these issues, we propose a novel unified vertical federated knowledge transfer framework (Unitrans). Our framework consists of three key steps. First, we extract the federated representation of overlapping patients by employing an effective vertical federated representation learning method to model multi-party joint features online. Next, each hospital learns a local knowledge transfer module offline, enabling the transfer of knowledge from the federated representation of overlapping patients to the enriched representation of local non-overlapping patients in a domain-adaptive manner. Finally, hospitals utilize these enriched local representations to enhance performance across various downstream medical prediction tasks. Experiments on real-world medical datasets validate the framework's dual effectiveness in both intra-domain and cross-domain knowledge transfer. The code of \method is available at \url{https://github.com/Chung-ju/Unitrans}.
中文: 提出的Unitrans框架通过联邦表征学习和领域自适应知识迁移,解决了垂直联邦学习中重叠与非重叠患者间的特征与标签异质性问题,并在真实医疗数据集上验证了其有效性。
English: The proposed Unitrans framework addresses feature and label heterogeneity in vertical federated learning by enabling domain-adaptive knowledge transfer from overlapping to non-overlapping patients through federated representation learning, validated on real medical datasets.

Authors:Zibin Wang, Zhiyuan Ouyang, Xiangyun Zhang
Title: Block Flow: Learning Straight Flow on Data Blocks
Abstract:
Flow-matching models provide a powerful framework for various applications, offering efficient sampling and flexible probability path modeling. These models are characterized by flows with low curvature in learned generative trajectories, which results in reduced truncation error at each sampling step. To further reduce curvature, we propose block matching. This novel approach leverages label information to partition the data distribution into blocks and match them with a prior distribution parameterized using the same label information, thereby learning straighter flows. We demonstrate that the variance of the prior distribution can control the curvature upper bound of forward trajectories in flow-matching models. By designing flexible regularization strategies to adjust this variance, we achieve optimal generation performance, effectively balancing the trade-off between maintaining diversity in generated samples and minimizing numerical solver errors. Our results demonstrate competitive performance with models of the same parameter scale.Code is available at \url{https://github.com/wpp13749/block_flow}.
中文摘要:本研究提出块匹配方法,通过将数据按标签分区并与参数化先验对齐来降低流匹配模型中的曲率,借助方差调节策略平衡生成多样性与数值误差,实现了优异性能。
English Summary: The study introduces block matching to reduce curvature in flow-matching models by partitioning data into labeled blocks and aligning them with a parameterized prior, achieving optimal generation through variance-controlled regularization.

Authors:Ruojun Xu, Weijie Xi, Xiaodi Wang, Yongbo Mao, Zach Cheng
Title: StyleSSP: Sampling StartPoint Enhancement for Training-free Diffusion-based Method for Style Transfer
Abstract:
Training-free diffusion-based methods have achieved remarkable success in style transfer, eliminating the need for extensive training or fine-tuning. However, due to the lack of targeted training for style information extraction and constraints on the content image layout, training-free methods often suffer from layout changes of original content and content leakage from style images. Through a series of experiments, we discovered that an effective startpoint in the sampling stage significantly enhances the style transfer process. Based on this discovery, we propose StyleSSP, which focuses on obtaining a better startpoint to address layout changes of original content and content leakage from style image. StyleSSP comprises two key components: (1) Frequency Manipulation: To improve content preservation, we reduce the low-frequency components of the DDIM latent, allowing the sampling stage to pay more attention to the layout of content images; and (2) Negative Guidance via Inversion: To mitigate the content leakage from style image, we employ negative guidance in the inversion stage to ensure that the startpoint of the sampling stage is distanced from the content of style image. Experiments show that StyleSSP surpasses previous training-free style transfer baselines, particularly in preserving original content and minimizing the content leakage from style image. Project page: https://github.com/bytedance/StyleSSP.
中文摘要:提出的StyleSSP方法通过频率操作和反向引导优化采样起点,有效解决了无训练风格迁移中的内容布局改变和风格图像内容泄露问题。
English Summary: The proposed StyleSSP method enhances training-free style transfer by improving the sampling startpoint through frequency manipulation and negative guidance, effectively preserving content layout and reducing style leakage.

Authors:Ziheng Zhang, Jianyang Gu, Arpita Chowdhury, Zheda Mai, David Carlyn, Tanya Berger-Wolf, Yu Su, Wei-Lun Chao
Title: Finer-CAM: Spotting the Difference Reveals Finer Details for Visual Explanation
Abstract:
Class activation map (CAM) has been widely used to highlight image regions that contribute to class predictions. Despite its simplicity and computational efficiency, CAM often struggles to identify discriminative regions that distinguish visually similar fine-grained classes. Prior efforts address this limitation by introducing more sophisticated explanation processes, but at the cost of extra complexity. In this paper, we propose Finer-CAM, a method that retains CAM's efficiency while achieving precise localization of discriminative regions. Our key insight is that the deficiency of CAM lies not in "how" it explains, but in "what" it explains. Specifically, previous methods attempt to identify all cues contributing to the target class's logit value, which inadvertently also activates regions predictive of visually similar classes. By explicitly comparing the target class with similar classes and spotting their differences, Finer-CAM suppresses features shared with other classes and emphasizes the unique, discriminative details of the target class. Finer-CAM is easy to implement, compatible with various CAM methods, and can be extended to multi-modal models for accurate localization of specific concepts. Additionally, Finer-CAM allows adjustable comparison strength, enabling users to selectively highlight coarse object contours or fine discriminative details. Quantitatively, we show that masking out the top 5% of activated pixels by Finer-CAM results in a larger relative confidence drop compared to baselines. The source code and demo are available at https://github.com/Imageomics/Finer-CAM.
Chinese: Finer-CAM通过显式比较相似类别来抑制共有特征并突出目标类的独特细节,在保持CAM效率的同时实现了对区分性区域的精确定位。
English: Finer-CAM improves upon class activation maps by efficiently highlighting unique discriminative regions through explicit comparison with similar classes, achieving precise localization without added complexity.

Authors:Yepeng Liu, Zhichao Sun, Baosheng Yu, Yitian Zhao, Bo Du, Yongchao Xu, Jun Cheng
Title: MIFNet: Learning Modality-Invariant Features for Generalizable Multimodal Image Matching
Abstract:
Many keypoint detection and description methods have been proposed for image matching or registration. While these methods demonstrate promising performance for single-modality image matching, they often struggle with multimodal data because the descriptors trained on single-modality data tend to lack robustness against the non-linear variations present in multimodal data. Extending such methods to multimodal image matching often requires well-aligned multimodal data to learn modality-invariant descriptors. However, acquiring such data is often costly and impractical in many real-world scenarios. To address this challenge, we propose a modality-invariant feature learning network (MIFNet) to compute modality-invariant features for keypoint descriptions in multimodal image matching using only single-modality training data. Specifically, we propose a novel latent feature aggregation module and a cumulative hybrid aggregation module to enhance the base keypoint descriptors trained on single-modality data by leveraging pre-trained features from Stable Diffusion models. %, our approach generates robust and invariant features across diverse and unknown modalities. We validate our method with recent keypoint detection and description methods in three multimodal retinal image datasets (CF-FA, CF-OCT, EMA-OCTA) and two remote sensing datasets (Optical-SAR and Optical-NIR). Extensive experiments demonstrate that the proposed MIFNet is able to learn modality-invariant feature for multimodal image matching without accessing the targeted modality and has good zero-shot generalization ability. The code will be released at https://github.com/lyp-deeplearning/MIFNet.
中文: 提出的MIFNet通过仅使用单模态训练数据学习模态不变特征,结合新型聚合模块和预训练Stable Diffusion模型增强基础描述符,无需对齐多模态数据即可在不同数据集中实现鲁棒的跨模态图像匹配。
English: The proposed MIFNet addresses the challenge of multimodal image matching by learning modality-invariant features using only single-modality training data, enhanced through novel aggregation modules and pre-trained Stable Diffusion features, achieving robust performance across diverse datasets without requiring aligned multimodal data.

Authors:Yanchao Wang, Dawei Zhang, Run Li, Zhonglong Zheng, Minglu Li
Title: PD-SORT: Occlusion-Robust Multi-Object Tracking Using Pseudo-Depth Cues
Abstract:
Multi-object tracking (MOT) is a rising topic in video processing technologies and has important application value in consumer electronics. Currently, tracking-by-detection (TBD) is the dominant paradigm for MOT, which performs target detection and association frame by frame. However, the association performance of TBD methods degrades in complex scenes with heavy occlusions, which hinders the application of such methods in real-world scenarios.To this end, we incorporate pseudo-depth cues to enhance the association performance and propose Pseudo-Depth SORT (PD-SORT). First, we extend the Kalman filter state vector with pseudo-depth states. Second, we introduce a novel depth volume IoU (DVIoU) by combining the conventional 2D IoU with pseudo-depth. Furthermore, we develop a quantized pseudo-depth measurement (QPDM) strategy for more robust data association. Besides, we also integrate camera motion compensation (CMC) to handle dynamic camera situations. With the above designs, PD-SORT significantly alleviates the occlusion-induced ambiguous associations and achieves leading performances on DanceTrack, MOT17, and MOT20. Note that the improvement is especially obvious on DanceTrack, where objects show complex motions, similar appearances, and frequent occlusions. The code is available at https://github.com/Wangyc2000/PD_SORT.
中文摘要:PD-SORT通过将伪深度信息融入卡尔曼滤波器并引入深度感知度量方法,有效解决了复杂场景中的遮挡问题,在多个基准数据集上实现了领先的跟踪性能。
English Summary: PD-SORT enhances multi-object tracking by incorporating pseudo-depth cues into the Kalman filter and introducing depth-aware metrics to address occlusion challenges, achieving state-of-the-art performance on benchmark datasets.

Authors:Xiangyang Hu, Xiangyu Shen, Yifei Sun, Xuhao Shan, Wenwen Min, Liyilei Su, Xiaomao Fan, Ahmed Elazab, Ruiquan Ge, Changmiao Wang, Xiaopeng Fan
Title: ITCFN: Incomplete Triple-Modal Co-Attention Fusion Network for Mild Cognitive Impairment Conversion Prediction
Abstract:
Alzheimer's disease (AD) is a common neurodegenerative disease among the elderly. Early prediction and timely intervention of its prodromal stage, mild cognitive impairment (MCI), can decrease the risk of advancing to AD. Combining information from various modalities can significantly improve predictive accuracy. However, challenges such as missing data and heterogeneity across modalities complicate multimodal learning methods as adding more modalities can worsen these issues. Current multimodal fusion techniques often fail to adapt to the complexity of medical data, hindering the ability to identify relationships between modalities. To address these challenges, we propose an innovative multimodal approach for predicting MCI conversion, focusing specifically on the issues of missing positron emission tomography (PET) data and integrating diverse medical information. The proposed incomplete triple-modal MCI conversion prediction network is tailored for this purpose. Through the missing modal generation module, we synthesize the missing PET data from the magnetic resonance imaging and extract features using specifically designed encoders. We also develop a channel aggregation module and a triple-modal co-attention fusion module to reduce feature redundancy and achieve effective multimodal data fusion. Furthermore, we design a loss function to handle missing modality issues and align cross-modal features. These components collectively harness multimodal data to boost network performance. Experimental results on the ADNI1 and ADNI2 datasets show that our method significantly surpasses existing unimodal and other multimodal models. Our code is available at https://github.com/justinhxy/ITFC.
中文摘要:本研究提出了一种创新的多模态网络,通过生成缺失的PET数据并整合多样化医疗信息来提升轻度认知障碍向阿尔茨海默病转化的预测准确性,实验证明其性能显著优于现有模型。
English Summary: This study introduces an innovative multimodal network that addresses missing PET data and integrates diverse medical information to improve the prediction of mild cognitive impairment conversion to Alzheimer's disease, demonstrating superior performance over existing models.

Authors:Ahmad Mousavi, Ramin Zandvakili
Title: $\ell_0$-Regularized Quadratic Surface Support Vector Machines
Abstract:
Kernel-free quadratic surface support vector machines have recently gained traction due to their flexibility in modeling nonlinear decision boundaries without relying on kernel functions. However, the introduction of a full quadratic classifier significantly increases the number of model parameters, scaling quadratically with data dimensionality, which often leads to overfitting and makes interpretation difficult. To address these challenges, we propose a sparse variant of the QSVM by enforcing a cardinality constraint on the model parameters. While enhancing generalization and promoting sparsity, leveraging the $\ell_0$-norm inevitably incurs additional computational complexity. To tackle this, we develop a penalty decomposition algorithm capable of producing solutions that provably satisfy the first-order Lu-Zhang optimality conditions. Our approach accommodates both hinge and quadratic loss functions. In both cases, we demonstrate that the subproblems arising within the algorithm either admit closed-form solutions or can be solved efficiently through dual formulations, which contributes to the method's overall effectiveness. We also analyze the convergence behavior of the algorithm under both loss settings. Finally, we validate our approach on several real-world datasets, demonstrating its ability to reduce overfitting while maintaining strong classification performance. The complete implementation and experimental code are publicly available at https://github.com/raminzandvakili/L0-QSVM.
中文: 作者提出一种基于ℓ₀范数约束的稀疏二次曲面支持向量机,通过惩罚分解算法有效降低过拟合并提升模型可解释性,在多个真实数据集上验证了该方法在保持分类性能的同时具有收敛性保证。
English: The authors propose a sparse quadratic surface support vector machine using an ℓ₀-norm constraint to reduce overfitting and improve interpretability, developing an efficient penalty decomposition algorithm with proven convergence that maintains strong classification performance on real datasets.

Authors:Hongwei Sha, Muchen Dong, Quanyou Luo, Ming Lu, Hao Chen, Zhan Ma
Title: Towards Loss-Resilient Image Coding for Unstable Satellite Networks
Abstract:
Geostationary Earth Orbit (GEO) satellite communication demonstrates significant advantages in emergency short burst data services. However, unstable satellite networks, particularly those with frequent packet loss, present a severe challenge to accurate image transmission. To address it, we propose a loss-resilient image coding approach that leverages end-to-end optimization in learned image compression (LIC). Our method builds on the channel-wise progressive coding framework, incorporating Spatial-Channel Rearrangement (SCR) on the encoder side and Mask Conditional Aggregation (MCA) on the decoder side to improve reconstruction quality with unpredictable errors. By integrating the Gilbert-Elliot model into the training process, we enhance the model's ability to generalize in real-world network conditions. Extensive evaluations show that our approach outperforms traditional and deep learning-based methods in terms of compression performance and stability under diverse packet loss, offering robust and efficient progressive transmission even in challenging environments. Code is available at https://github.com/NJUVISION/LossResilientLIC.
中文摘要:本文针对地球静止轨道卫星通信中的图像传输问题,提出了一种抗丢损的图像编码方法,通过端到端优化结合空间-通道重排和掩码条件聚合技术,在不可预测的丢包环境下仍能保持稳定的渐进式传输性能。
English Summary: This paper introduces a loss-resilient image coding method for GEO satellite communications that uses end-to-end optimization and integrates spatial-channel rearrangement with mask conditional aggregation to maintain robust progressive transmission under unpredictable packet loss.

Authors:Tuo Feng, Wenguan Wang, Yi Yang
Title: A Survey of World Models for Autonomous Driving
Abstract:
Recent breakthroughs in autonomous driving have been propelled by advances in robust world modeling, fundamentally transforming how vehicles interpret dynamic scenes and execute safe decision-making. World models have emerged as a linchpin technology, offering high-fidelity representations of the driving environment that integrate multi-sensor data, semantic cues, and temporal dynamics. This paper systematically reviews recent advances in world models for autonomous driving, proposing a three-tiered taxonomy: (i) Generation of Future Physical World, covering Image-, BEV-, OG-, and PC-based generation methods that enhance scene evolution modeling through diffusion models and 4D occupancy forecasting; (ii) Behavior Planning for Intelligent Agents, combining rule-driven and learning-based paradigms with cost map optimization and reinforcement learning for trajectory generation in complex traffic conditions; (ii) Interaction between Prediction and Planning, achieving multi-agent collaborative decision-making through latent space diffusion and memory-augmented architectures. The study further analyzes training paradigms, including self-supervised learning, multimodal pretraining, and generative data augmentation, while evaluating world models' performance in scene understanding and motion prediction tasks. Future research must address key challenges in self-supervised representation learning, multimodal fusion, and advanced simulation to advance the practical deployment of world models in complex urban environments. Overall, the comprehensive analysis provides a technical roadmap for harnessing the transformative potential of world models in advancing safe and reliable autonomous driving solutions.
中文: 世界建模的最新进展通过增强场景解读和决策制定,彻底改变了自动驾驶技术;本文系统综述了其分类体系、训练范式及未来挑战,为开发安全可靠的自动驾驶解决方案提供了技术路线图。
English: Recent advances in world modeling have revolutionized autonomous driving by enhancing scene interpretation and decision-making, with this paper systematically reviewing their taxonomy, training paradigms, and future challenges to guide the development of safe driving solutions.

Authors:Tal Zeevi, Lawrence H. Staib, John A. Onofrey
Title: Enhancing Uncertainty Estimation in Semantic Segmentation via Monte-Carlo Frequency Dropout
Abstract:
Monte-Carlo (MC) Dropout provides a practical solution for estimating predictive distributions in deterministic neural networks. Traditional dropout, applied within the signal space, may fail to account for frequency-related noise common in medical imaging, leading to biased predictive estimates. A novel approach extends Dropout to the frequency domain, allowing stochastic attenuation of signal frequencies during inference. This creates diverse global textural variations in feature maps while preserving structural integrity -- a factor we hypothesize and empirically show is contributing to accurately estimating uncertainties in semantic segmentation. We evaluated traditional MC-Dropout and the MC-frequency Dropout in three segmentation tasks involving different imaging modalities: (i) prostate zones in biparametric MRI, (ii) liver tumors in contrast-enhanced CT, and (iii) lungs in chest X-ray scans. Our results show that MC-Frequency Dropout improves calibration, convergence, and semantic uncertainty, thereby improving prediction scrutiny, boundary delineation, and has the potential to enhance medical decision-making.
中文:MC频率Dropout将dropout扩展至频域,通过增强校准、收敛和边界划分,在多种医学影像分割任务中改进了不确定性估计。
English: MC-Frequency Dropout extends dropout to the frequency domain, improving uncertainty estimation in medical image segmentation by enhancing calibration, convergence, and boundary delineation across various imaging modalities.

Authors:Konrad Lis, Tomasz Kryjak, Marek Gorgon
Title: LiFT: Lightweight, FPGA-tailored 3D object detection based on LiDAR data
Abstract:
This paper presents LiFT, a lightweight, fully quantized 3D object detection algorithm for LiDAR data, optimized for real-time inference on FPGA platforms. Through an in-depth analysis of FPGA-specific limitations, we identify a set of FPGA-induced constraints that shape the algorithm's design. These include a computational complexity limit of 30 GMACs (billion multiply-accumulate operations), INT8 quantization for weights and activations, 2D cell-based processing instead of 3D voxels, and minimal use of skip connections. To meet these constraints while maximizing performance, LiFT combines novel mechanisms with state-of-the-art techniques such as reparameterizable convolutions and fully sparse architecture. Key innovations include the Dual-bound Pillar Feature Net, which boosts performance without increasing complexity, and an efficient scheme for INT8 quantization of input features. With a computational cost of just 20.73 GMACs, LiFT stands out as one of the few algorithms targeting minimal-complexity 3D object detection. Among comparable methods, LiFT ranks first, achieving an mAP of 51.84% and an NDS of 61.01% on the challenging NuScenes validation dataset. The code will be available at https://github.com/vision-agh/lift.
中文: 本文提出LiFT,一种轻量级全量化激光雷达3D目标检测算法,专为FPGA实时推理设计,在NuScenes数据集上以极低计算复杂度实现了最优性能。
English: This paper introduces LiFT, a lightweight and fully quantized 3D object detection algorithm for LiDAR data, designed for real-time FPGA execution with minimal computational complexity and state-of-the-art performance on the NuScenes dataset.

Authors:William Doherty, Anton Lee, Heitor Murilo Gomes
Title: CLOFAI: A Dataset of Real And Fake Image Classification Tasks for Continual Learning
Abstract:
The rapid advancement of generative AI models capable of creating realistic media has led to a need for classifiers that can accurately distinguish between genuine and artificially-generated images. A significant challenge for these classifiers emerges when they encounter images from generative models that are not represented in their training data, usually resulting in diminished performance. A typical approach is to periodically update the classifier's training data with images from the new generative models then retrain the classifier on the updated dataset. However, in some real-life scenarios, storage, computational, or privacy constraints render this approach impractical. Additionally, models used in security applications may be required to rapidly adapt. In these circumstances, continual learning provides a promising alternative, as the classifier can be updated without retraining on the entire dataset. In this paper, we introduce a new dataset called CLOFAI (Continual Learning On Fake and Authentic Images), which takes the form of a domain-incremental image classification problem. Moreover, we showcase the applicability of this dataset as a benchmark for evaluating continual learning methodologies. In doing this, we set a baseline on our novel dataset using three foundational continual learning methods -- EWC, GEM, and Experience Replay -- and find that EWC performs poorly, while GEM and Experience Replay show promise, performing significantly better than a Naive baseline. The dataset and code to run the experiments can be accessed from the following GitHub repository: https://github.com/Will-Doherty/CLOFAI.
Chinese: CLOFAI数据集被提出作为持续学习方法的基准,旨在解决对未见过的生成式AI模型图像分类的挑战,其中GEM和经验回放方法相较于传统方法展现出更优的性能。
English: The CLOFAI dataset is introduced as a benchmark for continual learning methods to address the challenge of classifying images from unseen generative AI models, with GEM and Experience Replay showing promising results compared to traditional approaches.

Authors:Dominik Kulmer, Ilir Tahiraj, Andrii Chumak, Markus Lienkamp
Title: Multi-LiCa: A Motion and Targetless Multi LiDAR-to-LiDAR Calibration Framework
Abstract:
Today's autonomous vehicles rely on a multitude of sensors to perceive their environment. To improve the perception or create redundancy, the sensor's alignment relative to each other must be known. With Multi-LiCa, we present a novel approach for the alignment, e.g. calibration. We present an automatic motion- and targetless approach for the extrinsic multi LiDAR-to-LiDAR calibration without the need for additional sensor modalities or an initial transformation input. We propose a two-step process with feature-based matching for the coarse alignment and a GICP-based fine registration in combination with a cost-based matching strategy. Our approach can be applied to any number of sensors and positions if there is a partial overlap between the field of view of single sensors. We show that our pipeline is better generalized to different sensor setups and scenarios and is on par or better in calibration accuracy than existing approaches. The presented framework is integrated in ROS 2 but can also be used as a standalone application. To build upon our work, our source code is available at: https://github.com/TUMFTM/Multi_LiCa.
中文: Multi-LiCa提出了一种新颖的无目标、无运动辅助的多激光雷达自动外参标定方法,无需初始变换或其他传感器,在通用性和精度上均优于现有方法。
English: Multi-LiCa introduces a novel, automatic motion- and targetless method for extrinsic multi-LiDAR calibration, achieving superior generalization and accuracy without requiring initial transformations or additional sensors.

Authors:Elad Levi, Ilan Kadar
Title: IntellAgent: A Multi-Agent Framework for Evaluating Conversational AI Systems
Abstract:
Large Language Models (LLMs) are transforming artificial intelligence, evolving into task-oriented systems capable of autonomous planning and execution. One of the primary applications of LLMs is conversational AI systems, which must navigate multi-turn dialogues, integrate domain-specific APIs, and adhere to strict policy constraints. However, evaluating these agents remains a significant challenge, as traditional methods fail to capture the complexity and variability of real-world interactions. We introduce IntellAgent, a scalable, open-source multi-agent framework designed to evaluate conversational AI systems comprehensively. IntellAgent automates the creation of diverse, synthetic benchmarks by combining policy-driven graph modeling, realistic event generation, and interactive user-agent simulations. This innovative approach provides fine-grained diagnostics, addressing the limitations of static and manually curated benchmarks with coarse-grained metrics. IntellAgent represents a paradigm shift in evaluating conversational AI. By simulating realistic, multi-policy scenarios across varying levels of complexity, IntellAgent captures the nuanced interplay of agent capabilities and policy constraints. Unlike traditional methods, it employs a graph-based policy model to represent relationships, likelihoods, and complexities of policy interactions, enabling highly detailed diagnostics. IntellAgent also identifies critical performance gaps, offering actionable insights for targeted optimization. Its modular, open-source design supports seamless integration of new domains, policies, and APIs, fostering reproducibility and community collaboration. Our findings demonstrate that IntellAgent serves as an effective framework for advancing conversational AI by addressing challenges in bridging research and deployment. The framework is available at https://github.com/plurai-ai/intellagent
中文摘要:IntellAgent是一个开源的多智能体框架,通过模拟真实的多策略场景并生成精细化诊断,自动创建合成基准来全面评估对话式人工智能系统。
English Summary: IntellAgent is an open-source multi-agent framework that automates the creation of synthetic benchmarks to comprehensively evaluate conversational AI systems by simulating realistic multi-policy scenarios and providing fine-grained diagnostics.

Authors:Zhipeng Yu, Qianqian Xu, Yangbangyan Jiang, Yingfei Sun, Qingming Huang
Title: Enhancing Sample Utilization in Noise-Robust Deep Metric Learning With Subgroup-Based Positive-Pair Selection
Abstract:
The existence of noisy labels in real-world data negatively impacts the performance of deep learning models. Although much research effort has been devoted to improving the robustness towards noisy labels in classification tasks, the problem of noisy labels in deep metric learning (DML) remains under-explored. Existing noisy label learning methods designed for DML mainly discard suspicious noisy samples, resulting in a waste of the training data. To address this issue, we propose a noise-robust DML framework with SubGroup-based Positive-pair Selection (SGPS), which constructs reliable positive pairs for noisy samples to enhance the sample utilization. Specifically, SGPS first effectively identifies clean and noisy samples by a probability-based clean sample selectionstrategy. To further utilize the remaining noisy samples, we discover their potential similar samples based on the subgroup information given by a subgroup generation module and then aggregate them into informative positive prototypes for each noisy sample via a positive prototype generation module. Afterward, a new contrastive loss is tailored for the noisy samples with their selected positive pairs. SGPS can be easily integrated into the training process of existing pair-wise DML tasks, like image retrieval and face recognition. Extensive experiments on multiple synthetic and real-world large-scale label noise datasets demonstrate the effectiveness of our proposed method. Without any bells and whistles, our SGPS framework outperforms the state-of-the-art noisy label DML methods. Code is available at \url{https://github.com/smuelpeng/SGPS-NoiseFreeDML}.
中文: 本文提出了一种名为SGPS的抗噪声深度度量学习框架,通过为噪声样本构建可靠的正样本对来提高数据利用率,在多个数据集上超越了现有方法。
English: This paper introduces a noise-robust deep metric learning framework called SGPS that constructs reliable positive pairs for noisy samples to improve data utilization, outperforming existing methods on multiple datasets.

Authors:Eunjin Kim, Hyeonjin Kim, Kyong Hwan Jin, Jaejun Yoo
Title: BF-STVSR: B-Splines and Fourier-Best Friends for High Fidelity Spatial-Temporal Video Super-Resolution
Abstract:
While prior methods in Continuous Spatial-Temporal Video Super-Resolution (C-STVSR) employ Implicit Neural Representation (INR) for continuous encoding, they often struggle to capture the complexity of video data, relying on simple coordinate concatenation and pre-trained optical flow networks for motion representation. Interestingly, we find that adding position encoding, contrary to common observations, does not improve--and even degrades--performance. This issue becomes particularly pronounced when combined with pre-trained optical flow networks, which can limit the model's flexibility. To address these issues, we propose BF-STVSR, a C-STVSR framework with two key modules tailored to better represent spatial and temporal characteristics of video: 1) B-spline Mapper for smooth temporal interpolation, and 2) Fourier Mapper for capturing dominant spatial frequencies. Our approach achieves state-of-the-art in various metrics, including PSNR and SSIM, showing enhanced spatial details and natural temporal consistency. Our code is available https://github.com/Eunjnnn/bfstvsr.
中文: 先前基于隐式神经表示的连续时空视频超分辨率方法难以充分建模视频复杂性,而我们提出的BF-STVSR框架通过B样条映射器和傅里叶映射器更好地捕捉时空特征,实现了最先进的性能表现。
English: Previous C-STVSR methods using Implicit Neural Representation often fail to adequately model video complexity, but our proposed BF-STVSR framework with B-spline and Fourier mappers achieves state-of-the-art performance by better capturing spatial and temporal characteristics.

Authors:Qi Cheems Wang, Zehao Xiao, Yixiu Mao, Yun Qu, Jiayi Shen, Yiqin Lv, Xiangyang Ji
Title: Model Predictive Task Sampling for Efficient and Robust Adaptation
Abstract:
Foundation models have revolutionized general-purpose problem-solving, offering rapid task adaptation through pretraining, meta-training, and finetuning. Recent crucial advances in these paradigms reveal the importance of challenging task prioritized sampling to enhance adaptation robustness under distribution shifts. However, ranking task difficulties over iteration as a preliminary step typically requires exhaustive task evaluation, which is practically unaffordable in computation and data-annotation. This study provides a novel perspective to illuminate the possibility of leveraging the dual importance of adaptation robustness and learning efficiency, particularly in scenarios where task evaluation is risky or costly, such as iterative agent-environment interactions for robotic policy evaluation or computationally intensive inference steps for finetuning foundation models. Firstly, we introduce Model Predictive Task Sampling (MPTS), a framework that bridges the task space and adaptation risk landscape, providing a theoretical foundation for robust active task sampling. MPTS employs a generative model to characterize the episodic optimization process and predicts task-specific adaptation risk via posterior inference. The resulting risk learner amortizes the costly evaluation of task adaptation performance and provably approximates task difficulty rankings. MPTS seamlessly integrates into zero-shot, few-shot, and supervised finetuning settings. Empirically, we conduct extensive experiments in pattern recognition using foundation models and sequential decision-making. Our results demonstrate that MPTS significantly enhances adaptation robustness for tail or out-of-distribution (OOD) tasks and improves learning efficiency compared to state-of-the-art (SOTA) methods. The code is available at the project site https://github.com/thu-rllab/MPTS.
Chinese: 本研究提出了模型预测任务采样(MPTS)框架,通过预测任务难度来提升基础模型的适应鲁棒性和学习效率,无需全面评估,特别适用于处理尾部或分布外任务。
English: This study introduces Model Predictive Task Sampling (MPTS), a framework that efficiently predicts task difficulty and enhances adaptation robustness and learning efficiency for foundation models without exhaustive evaluations, particularly benefiting tail and out-of-distribution tasks.

Authors:Sani Abdullahi Sani, Shamsuddeen Hassan Muhammad, Devon Jarvis
Title: Investigating the Impact of Language-Adaptive Fine-Tuning on Sentiment Analysis in Hausa Language Using AfriBERTa
Abstract:
Sentiment analysis (SA) plays a vital role in Natural Language Processing (NLP) by ~identifying sentiments expressed in text. Although significant advances have been made in SA for widely spoken languages, low-resource languages such as Hausa face unique challenges, primarily due to a lack of digital resources. This study investigates the effectiveness of Language-Adaptive Fine-Tuning (LAFT) to improve SA performance in Hausa. We first curate a diverse, unlabeled corpus to expand the model's linguistic capabilities, followed by applying LAFT to adapt AfriBERTa specifically to the nuances of the Hausa language. The adapted model is then fine-tuned on the labeled NaijaSenti sentiment dataset to evaluate its performance. Our findings demonstrate that LAFT gives modest improvements, which may be attributed to the use of formal Hausa text rather than informal social media data. Nevertheless, the pre-trained AfriBERTa model significantly outperformed models not specifically trained on Hausa, highlighting the importance of using pre-trained models in low-resource contexts. This research emphasizes the necessity for diverse data sources to advance NLP applications for low-resource African languages. We published the code and the dataset to encourage further research and facilitate reproducibility in low-resource NLP here: https://github.com/Sani-Abdullahi-Sani/Natural-Language-Processing/blob/main/Sentiment%20Analysis%20for%20Low%20Resource%20African%20Languages
中文: 本研究采用语言自适应微调技术提升豪萨语情感分析效果,在正式文本中取得适度改进,同时证明AfriBERTa模型在低资源语言环境中的优越性,并强调非洲语言自然语言处理需要多样化数据支持。
English: This study applies Language-Adaptive Fine-Tuning to enhance sentiment analysis for Hausa, showing modest gains with formal texts while demonstrating AfriBERTa's superiority in low-resource settings and emphasizing the need for diverse data in African language NLP.

Authors:Yuxia Wang, Artem Shelmanov, Jonibek Mansurov, Akim Tsvigun, Vladislav Mikhailov, Rui Xing, Zhuohan Xie, Jiahui Geng, Giovanni Puccetti, Ekaterina Artemova, Jinyan Su, Minh Ngoc Ta, Mervat Abassy, Kareem Ashraf Elozeiri, Saad El Dine Ahmed El Etter, Maiya Goloburda, Tarek Mahmoud, Raj Vardhan Tomar, Nurkhan Laiyk, Osama Mohammed Afzal, Ryuto Koike, Masahiro Kaneko, Alham Fikri Aji, Nizar Habash, Iryna Gurevych, Preslav Nakov
Title: GenAI Content Detection Task 1: English and Multilingual Machine-Generated Text Detection: AI vs. Human
Abstract:
We present the GenAI Content Detection Task~1 -- a shared task on binary machine generated text detection, conducted as a part of the GenAI workshop at COLING 2025. The task consists of two subtasks: Monolingual (English) and Multilingual. The shared task attracted many participants: 36 teams made official submissions to the Monolingual subtask during the test phase and 26 teams -- to the Multilingual. We provide a comprehensive overview of the data, a summary of the results -- including system rankings and performance scores -- detailed descriptions of the participating systems, and an in-depth analysis of submissions. https://github.com/mbzuai-nlp/COLING-2025-Workshop-on-MGT-Detection-Task1
中文: COLING 2025的GenAI内容检测任务1聚焦于二进制机器生成文本检测,包含单语和多语言子任务,分别吸引了36支和26支团队参与,并提供了数据概览、结果总结、系统描述及提交内容的深入分析。
English: The GenAI Content Detection Task 1 at COLING 2025 involved binary machine-generated text detection with Monolingual and Multilingual subtasks, attracting 36 and 26 teams respectively, and included data overviews, results, system descriptions, and submission analyses.

Authors:Haichao Wei, Yunxiang Ren, Zhoutong Fu, Aman Lunia, Yi-Lin Chen, Alice Leung, Ya Xu
Title: Control LLM: Controlled Evolution for Intelligence Retention in LLM
Abstract:
Large Language Models (LLMs) demand significant computational resources, making it essential to enhance their capabilities without retraining from scratch. A key challenge in this domain is \textit{catastrophic forgetting} (CF), which hampers performance during Continuous Pre-training (CPT) and Continuous Supervised Fine-Tuning (CSFT). We propose \textbf{Control LLM}, a novel approach that leverages parallel pre-trained and expanded transformer blocks, aligning their hidden-states through interpolation strategies This method effectively preserves performance on existing tasks while seamlessly integrating new knowledge. Extensive experiments demonstrate the effectiveness of Control LLM in both CPT and CSFT. On Llama3.1-8B-Instruct, it achieves significant improvements in mathematical reasoning ($+14.4\%$ on Math-Hard) and coding performance ($+10\%$ on MBPP-PLUS). On Llama3.1-8B, it enhances multilingual capabilities ($+10.6\%$ on C-Eval, $+6.8\%$ on CMMLU, and $+30.2\%$ on CMMLU-0shot-CoT). It surpasses existing methods and achieves SOTA among open-source models tuned from the same base model, using substantially less data and compute. Crucially, these gains are realized while preserving strong original capabilities, with minimal degradation ($<4.3\% \text{on MMLU}$) compared to $>35\%$ in open-source Math and Coding models. This approach has been successfully deployed in LinkedIn's GenAI-powered job seeker and Ads unit products. To support further research, we release the training and evaluation code (https://github.com/linkedin/ControlLLM) along with models trained on public datasets (https://huggingface.co/ControlLLM) to the community.
中文: Control LLM通过并行Transformer模块和插值策略的新方法,在持续训练中有效防止灾难性遗忘,在数学推理、编程和多语言任务上实现最优性能,同时保持原有能力且性能下降极小。
English: Control LLM is a novel method that uses parallel transformer blocks and interpolation to prevent catastrophic forgetting during continuous training, achieving state-of-the-art performance in mathematical reasoning, coding, and multilingual tasks while preserving original capabilities with minimal degradation.

Authors:Zhanpeng Chen, Mingxiao Li, Ziyang Chen, Nan Du, Xiaolong Li, Yuexian Zou
Title: Advancing General Multimodal Capability of Vision-language Models with Pyramid-descent Visual Position Encoding
Abstract:
Vision-language Models (VLMs) have shown remarkable capabilities in advancing general artificial intelligence, yet the irrational encoding of visual positions persists in inhibiting the models' comprehensive perception performance across different levels of granularity. In this work, we propose Pyramid-descent Visual Position Encoding (PyPE), a novel approach designed to enhance the perception of visual tokens within VLMs. By assigning visual position indexes from the periphery to the center and expanding the central receptive field incrementally, PyPE addresses the limitations of traditional raster-scan methods and mitigates the long-term decay effects induced by Rotary Position Embedding (RoPE). Our method reduces the relative distance between interrelated visual elements and instruction tokens, promoting a more rational allocation of attention weights and allowing for a multi-granularity perception of visual elements and countering the over-reliance on anchor tokens. Extensive experimental evaluations demonstrate that PyPE consistently improves the general capabilities of VLMs across various sizes. Code is available at https://github.com/SakuraTroyChen/PyPE.
中文: 提出的金字塔式视觉位置编码(PyPE)通过从外围到中心的视觉位置索引和逐步扩展中心感受野,增强了视觉语言模型的多粒度感知能力,有效提升了不同规模模型的综合性能。
English: The proposed Pyramid-descent Visual Position Encoding (PyPE) enhances vision-language models by rationally encoding visual positions to improve multi-granularity perception and attention allocation, consistently boosting model performance across various scales.

Authors:Weiyu Chen, Baijiong Lin, Xiaoyuan Zhang, Xi Lin, Han Zhao, Qingfu Zhang, James T. Kwok
Title: Gradient-Based Multi-Objective Deep Learning: Algorithms, Theories, Applications, and Beyond
Abstract:
Many modern deep learning applications require balancing multiple objectives that are often conflicting. Examples include multi-task learning, fairness-aware learning, and the alignment of Large Language Models (LLMs). This leads to multi-objective deep learning, which tries to find optimal trade-offs or Pareto-optimal solutions by adapting mathematical principles from the field of Multi-Objective Optimization (MOO). However, directly applying gradient-based MOO techniques to deep neural networks presents unique challenges, including high computational costs, optimization instability, and the difficulty of effectively incorporating user preferences. This paper provides a comprehensive survey of gradient-based techniques for multi-objective deep learning. We systematically categorize existing algorithms based on their outputs: (i) methods that find a single, well-balanced solution, (ii) methods that generate a finite set of diverse Pareto-optimal solutions, and (iii) methods that learn a continuous Pareto set of solutions. In addition to this taxonomy, the survey covers theoretical analyses, key applications, practical resources, and highlights open challenges and promising directions for future research. A comprehensive list of multi-objective deep learning algorithms is available at https://github.com/Baijiong-Lin/Awesome-Multi-Objective-Deep-Learning.
中文摘要:本文系统综述了基于梯度的多目标深度学习优化方法,按解决方案类型分类并探讨了计算成本高、优化不稳定等关键挑战。
English Summary: This paper surveys gradient-based multi-objective optimization techniques for deep learning, categorizing them by solution types and addressing challenges like computational costs and optimization instability.

Authors:Jing Ding, Kai Feng, Binbin Lin, Jiarui Cai, Qiushi Wang, Yu Xie, Xiaojin Zhang, Zhongyu Wei, Wei Chen
Title: InsQABench: Benchmarking Chinese Insurance Domain Question Answering with Large Language Models
Abstract:
The application of large language models (LLMs) has achieved remarkable success in various fields, but their effectiveness in specialized domains like the Chinese insurance industry remains underexplored. The complexity of insurance knowledge, encompassing specialized terminology and diverse data types, poses significant challenges for both models and users. To address this, we introduce InsQABench, a benchmark dataset for the Chinese insurance sector, structured into three categories: Insurance Commonsense Knowledge, Insurance Structured Database, and Insurance Unstructured Documents, reflecting real-world insurance question-answering tasks.We also propose two methods, SQL-ReAct and RAG-ReAct, to tackle challenges in structured and unstructured data tasks. Evaluations show that while LLMs struggle with domain-specific terminology and nuanced clause texts, fine-tuning on InsQABench significantly improves performance. Our benchmark establishes a solid foundation for advancing LLM applications in the insurance domain, with data and code available at https://github.com/HaileyFamo/InsQABench.git.
Chinese Summary: 本研究针对中文保险领域推出InsQABench评估基准,通过实验证明虽然大语言模型在专业术语理解上存在困难,但基于该数据集的微调能显著提升模型在保险问答任务中的表现。
English Summary: This study introduces InsQABench, a specialized benchmark for evaluating large language models in the Chinese insurance industry, and demonstrates that fine-tuning with this dataset significantly enhances model performance despite initial challenges with domain-specific terminology.

Authors:Sijun Dong, Fangcheng Zuo, Geng Chen, Siming Fu, Xiaoliang Meng
Title: A Remote Sensing Image Change Detection Method Integrating Layer Exchange and Channel-Spatial Differences
Abstract:
Change detection in remote sensing imagery is a critical technique for Earth observation, primarily focusing on pixel-level segmentation of change regions between bi-temporal images. The essence of pixel-level change detection lies in determining whether corresponding pixels in bi-temporal images have changed. In deep learning, the spatial and channel dimensions of feature maps represent different information from the original images. In this study, we found that in change detection tasks, difference information can be computed not only from the spatial dimension of bi-temporal features but also from the channel dimension. Therefore, we designed the Channel-Spatial Difference Weighting (CSDW) module as an aggregation-distribution mechanism for bi-temporal features in change detection. This module enhances the sensitivity of the change detection model to difference features. Additionally, bi-temporal images share the same geographic location and exhibit strong inter-image correlations. To construct the correlation between bi-temporal images, we designed a decoding structure based on the Layer-Exchange (LE) method to enhance the interaction of bi-temporal features. Comprehensive experiments on the CLCD, PX-CLCD, LEVIR-CD, and S2Looking datasets demonstrate that the proposed LENet model significantly improves change detection performance. The code and pre-trained models will be available at: https://github.com/dyzy41/lenet.
中文摘要:本研究提出通道-空间差异加权模块和层交换解码结构,通过增强双时相特征交互来改进遥感变化检测,所开发的LENet模型在多个数据集上展现出卓越性能。
English Summary: This study introduces the Channel-Spatial Difference Weighting (CSDW) module and a Layer-Exchange (LE) decoding structure to enhance bi-temporal feature interaction in remote sensing change detection, with the proposed LENet model demonstrating superior performance across multiple datasets.

Authors:Saibo Geng, Hudson Cooper, Michał Moskal, Samuel Jenkins, Julian Berman, Nathan Ranchin, Robert West, Eric Horvitz, Harsha Nori
Title: JSONSchemaBench: A Rigorous Benchmark of Structured Outputs for Language Models
Abstract:
Reliably generating structured outputs has become a critical capability for modern language model (LM) applications. Constrained decoding has emerged as the dominant technology across sectors for enforcing structured outputs during generation. Despite its growing adoption, little has been done with the systematic evaluation of the behaviors and performance of constrained decoding. Constrained decoding frameworks have standardized around JSON Schema as a structured data format, with most uses guaranteeing constraint compliance given a schema. However, there is poor understanding of the effectiveness of the methods in practice. We present an evaluation framework to assess constrained decoding approaches across three critical dimensions: efficiency in generating constraint-compliant outputs, coverage of diverse constraint types, and quality of the generated outputs. To facilitate this evaluation, we introduce JSONSchemaBench, a benchmark for constrained decoding comprising 10K real-world JSON schemas that encompass a wide range of constraints with varying complexity. We pair the benchmark with the existing official JSON Schema Test Suite and evaluate six state-of-the-art constrained decoding frameworks, including Guidance, Outlines, Llamacpp, XGrammar, OpenAI, and Gemini. Through extensive experiments, we gain insights into the capabilities and limitations of constrained decoding on structured generation with real-world JSON schemas. Our work provides actionable insights for improving constrained decoding frameworks and structured generation tasks, setting a new standard for evaluating constrained decoding and structured generation. We release JSONSchemaBench at https://github.com/guidance-ai/jsonschemabench
Chinese: 本文提出了一个评估框架和JSONSchemaBench基准,通过测试六种先进约束解码框架,系统评估了语言模型结构化输出的约束解码方法,揭示了其在效率、约束覆盖和输出质量方面的重要特性。
English: This paper introduces a comprehensive evaluation framework and JSONSchemaBench benchmark to systematically assess constrained decoding methods for structured language model outputs, revealing key insights about their efficiency, constraint coverage, and output quality through testing six state-of-the-art frameworks.

Authors:Young Seok Jeon, Hongfei Yang, Huazhu Fu, Mengling Feng
Title: No More Sliding Window: Efficient 3D Medical Image Segmentation with Differentiable Top-k Patch Sampling
Abstract:
3D models surpass 2D models in CT/MRI segmentation by effectively capturing inter-slice relationships. However, the added depth dimension substantially increases memory consumption. While patch-based training alleviates memory constraints, it significantly slows down the inference speed due to the sliding window (SW) approach. We propose No-More-Sliding-Window (NMSW), a novel end-to-end trainable framework that enhances the efficiency of generic 3D segmentation backbone during an inference step by eliminating the need for SW. NMSW employs a differentiable Top-k module to selectively sample only the most relevant patches, thereby minimizing redundant computations. When patch-level predictions are insufficient, the framework intelligently leverages coarse global predictions to refine results. Evaluated across 3 tasks using 3 segmentation backbones, NMSW achieves competitive accuracy compared to SW inference while significantly reducing computational complexity by 91% (88.0 to 8.00 TMACs). Moreover, it delivers a 9.1x faster inference on the H100 GPU (99.0 to 8.3 sec) and a 11.1x faster inference on the Xeon Gold CPU (2110 to 189 sec). NMSW is model-agnostic, further boosting efficiency when integrated with any existing efficient segmentation backbones. The code is avaialble: https://github.com/Youngseok0001/open_nmsw.
中文摘要:提出的NMSW框架通过选择性采样关键切片并融合全局预测,在保持精度的同时消除了3D医学图像分割中的滑动窗口操作,实现了91%的计算量削减和超过9倍的推理加速。
English Summary: The proposed NMSW framework eliminates sliding window inference in 3D medical image segmentation by selectively sampling relevant patches and leveraging global predictions, achieving 91% computational reduction and over 9x faster inference while maintaining competitive accuracy.

Authors:Pengcheng Zhao, Zhixian He, Fuwei Zhang, Shujin Lin, Fan Zhou
Title: LD-DETR: Loop Decoder DEtection TRansformer for Video Moment Retrieval and Highlight Detection
Abstract:
Video Moment Retrieval and Highlight Detection aim to find corresponding content in the video based on a text query. Existing models usually first use contrastive learning methods to align video and text features, then fuse and extract multimodal information, and finally use a Transformer Decoder to decode multimodal information. However, existing methods face several issues: (1) Overlapping semantic information between different samples in the dataset hinders the model's multimodal aligning performance; (2) Existing models are not able to efficiently extract local features of the video; (3) The Transformer Decoder used by the existing model cannot adequately decode multimodal features. To address the above issues, we proposed the LD-DETR model for Video Moment Retrieval and Highlight Detection tasks. Specifically, we first distilled the similarity matrix into the identity matrix to mitigate the impact of overlapping semantic information. Then, we designed a method that enables convolutional layers to extract multimodal local features more efficiently. Finally, we fed the output of the Transformer Decoder back into itself to adequately decode multimodal information. We evaluated LD-DETR on four public benchmarks and conducted extensive experiments to demonstrate the superiority and effectiveness of our approach. Our model outperforms the State-Of-The-Art models on QVHighlight, Charades-STA and TACoS datasets. Our code is available at https://github.com/qingchen239/ld-detr.
Chinese: LD-DETR模型通过将相似度矩阵蒸馏为恒等矩阵缓解语义重叠问题,利用卷积层高效提取局部特征,并通过Transformer解码器自反馈充分解码多模态信息,在多个公开数据集上实现了最先进的视频片段检索和高亮检测性能。
English: The LD-DETR model addresses key challenges in video moment retrieval and highlight detection by mitigating overlapping semantic information through similarity matrix distillation, enhancing local feature extraction with convolutional layers, and improving multimodal decoding via Transformer Decoder feedback, achieving state-of-the-art performance on multiple benchmarks.

Authors:Xinjie Liang, Xiangyu Li, Fanding Li, Jie Jiang, Qing Dong, Wei Wang, Kuanquan Wang, Suyu Dong, Gongning Luo, Shuo Li
Title: MedFILIP: Medical Fine-grained Language-Image Pre-training
Abstract:
Medical vision-language pretraining (VLP) that leverages naturally-paired medical image-report data is crucial for medical image analysis. However, existing methods struggle to accurately characterize associations between images and diseases, leading to inaccurate or incomplete diagnostic results. In this work, we propose MedFILIP, a fine-grained VLP model, introduces medical image-specific knowledge through contrastive learning, specifically: 1) An information extractor based on a large language model is proposed to decouple comprehensive disease details from reports, which excels in extracting disease deals through flexible prompt engineering, thereby effectively reducing text complexity while retaining rich information at a tiny cost. 2) A knowledge injector is proposed to construct relationships between categories and visual attributes, which help the model to make judgments based on image features, and fosters knowledge extrapolation to unfamiliar disease categories. 3) A semantic similarity matrix based on fine-grained annotations is proposed, providing smoother, information-richer labels, thus allowing fine-grained image-text alignment. 4) We validate MedFILIP on numerous datasets, e.g., RSNA-Pneumonia, NIH ChestX-ray14, VinBigData, and COVID-19. For single-label, multi-label, and fine-grained classification, our model achieves state-of-the-art performance, the classification accuracy has increased by a maximum of 6.69\%. The code is available in https://github.com/PerceptionComputingLab/MedFILIP.
Chinese: MedFILIP是一种细粒度视觉语言预训练模型,通过对比学习增强医学图像与疾病的关联性,在多个数据集上实现最高6.69%准确率提升的领先性能。
English: MedFILIP is a fine-grained vision-language pretraining model that enhances medical image-disease associations through contrastive learning, achieving state-of-the-art performance across multiple datasets with up to 6.69% accuracy improvement.

Authors:Jinyuan Liu, Guanyao Wu, Zhu Liu, Di Wang, Zhiying Jiang, Long Ma, Wei Zhong, Xin Fan, Risheng Liu
Title: Infrared and Visible Image Fusion: From Data Compatibility to Task Adaption
Abstract:
Infrared-visible image fusion (IVIF) is a critical task in computer vision, aimed at integrating the unique features of both infrared and visible spectra into a unified representation. Since 2018, the field has entered the deep learning era, with an increasing variety of approaches introducing a range of networks and loss functions to enhance visual performance. However, challenges such as data compatibility, perception accuracy, and efficiency remain. Unfortunately, there is a lack of recent comprehensive surveys that address this rapidly expanding domain. This paper fills that gap by providing a thorough survey covering a broad range of topics. We introduce a multi-dimensional framework to elucidate common learning-based IVIF methods, from visual enhancement strategies to data compatibility and task adaptability. We also present a detailed analysis of these approaches, accompanied by a lookup table clarifying their core ideas. Furthermore, we summarize performance comparisons, both quantitatively and qualitatively, focusing on registration, fusion, and subsequent high-level tasks. Beyond technical analysis, we discuss potential future directions and open issues in this area. For further details, visit our GitHub repository: https://github.com/RollingPlain/IVIF_ZOO.
中文摘要:本文对红外-可见光图像融合领域进行了全面综述,系统分析了深度学习方法、性能比较及未来研究方向。
English Summary: This paper provides a comprehensive survey of infrared-visible image fusion, analyzing deep learning methods, performance comparisons, and future directions in the field.

Authors:Yaniv Shulman
Title: Robust Local Polynomial Regression with Similarity Kernels
Abstract:
Local Polynomial Regression (LPR) is a widely used nonparametric method for modeling complex relationships due to its flexibility and simplicity. It estimates a regression function by fitting low-degree polynomials to localized subsets of the data, weighted by proximity. However, traditional LPR is sensitive to outliers and high-leverage points, which can significantly affect estimation accuracy. This paper revisits the kernel function used to compute regression weights and proposes a novel framework that incorporates both predictor and response variables in the weighting mechanism. The focus of this work is a conditional density kernel that robustly estimates weights by mitigating the influence of outliers through localized density estimation. A related joint density kernel is also discussed in an appendix. The proposed method is implemented in Python and is publicly available at https://github.com/yaniv-shulman/rsklpr, demonstrating competitive performance in synthetic benchmark experiments. Compared to standard LPR, the proposed approach consistently improves robustness and accuracy, especially in heteroscedastic and noisy environments, without requiring multiple iterations. This advancement provides a promising extension to traditional LPR, opening new possibilities for robust regression applications.
中文: 本文提出了局部多项式回归的鲁棒扩展方法,通过引入条件密度核函数进行局部密度估计来降低异常值敏感性,在噪声环境中无需迭代即可提升估计精度。
English: This paper introduces a robust extension to Local Polynomial Regression by incorporating a conditional density kernel that mitigates outlier sensitivity through localized density estimation, improving accuracy in noisy environments without iterative steps.

Authors:Weihang Zhang, Jihao Li, Shuoke Li, Ziqing Niu, Jialiang Chen, Wenkai Zhang
Title: A Resource-Efficient Training Framework for Remote Sensing Text--Image Retrieval
Abstract:
Remote sensing text--image retrieval (RSTIR) aims to retrieve the matched remote sensing (RS) images from the database according to the descriptive text. Recently, the rapid development of large visual-language pre-training models provides new insights for RSTIR. Nevertheless, as the complexity of models grows in RSTIR, the previous studies suffer from suboptimal resource efficiency during transfer learning. To address this issue, we propose a computation and memory-efficient retrieval (CMER) framework for RSTIR. To reduce the training memory consumption, we propose the Focus-Adapter module, which adopts a side branch structure. Its focus layer suppresses the interference of background pixels for small targets. Simultaneously, to enhance data efficacy, we regard the RS scene category as the metadata and design a concise augmentation technique. The scene label augmentation leverages the prior knowledge from land cover categories and shrinks the search space. We propose the negative sample recycling strategy to make the negative sample pool decoupled from the mini-batch size. It improves the generalization performance without introducing additional encoders. We have conducted quantitative and qualitative experiments on public datasets and expanded the benchmark with some advanced approaches, which demonstrates the competitiveness of the proposed CMER. Compared with the recent advanced methods, the overall retrieval performance of CMER is 2%--5% higher on RSITMD. Moreover, our proposed method reduces memory consumption by 49% and has a 1.4x data throughput during training. The code of the CMER and the dataset will be released at https://github.com/ZhangWeihang99/CMER.
中文: 本研究提出了一种用于遥感图文检索的计算与内存高效检索(CMER)框架,通过创新模块和策略,在提升检索性能2%–5%的同时,将内存消耗降低49%并实现1.4倍的数据处理效率提升。
English: The study introduces a computation and memory-efficient retrieval (CMER) framework for remote sensing text-image retrieval, which enhances retrieval performance by 2%–5%, reduces memory usage by 49%, and increases data throughput by 1.4 times through innovative modules and strategies.

Authors:Mehrad Mortazavi, David J. Cappelleri, Reza Ehsani
Title: RoMu4o: A Robotic Manipulation Unit For Orchard Operations Automating Proximal Hyperspectral Leaf Sensing
Abstract:
Driven by the need to address labor shortages and meet the demands of a rapidly growing population, robotic automation has become a critical component in precision agriculture. Leaf-level hyperspectral spectroscopy is shown to be a powerful tool for phenotyping, monitoring crop health, identifying essential nutrients within plants as well as detecting diseases and water stress. This work introduces RoMu4o, a robotic manipulation unit for orchard operations offering an automated solution for proximal hyperspectral leaf sensing. This ground robot is equipped with a 6DOF robotic arm and vision system for real-time deep learning-based image processing and motion planning. We developed robust perception and manipulation pipelines that enable the robot to successfully grasp target leaves and perform spectroscopy. These frameworks operate synergistically to identify and extract the 3D structure of leaves from an observed batch of foliage, propose 6D poses, and generate collision-free constraint-aware paths for precise leaf manipulation. The end-effector of the arm features a compact design that integrates an independent lighting source with a hyperspectral sensor, enabling high-fidelity data acquisition while streamlining the calibration process for accurate measurements. Our ground robot is engineered to operate in unstructured orchard environments. However, the performance of the system is evaluated in both indoor and outdoor plant models. The system demonstrated reliable performance for 1-LPB hyperspectral sampling, achieving 95% success rate in lab trials and 79% in field trials. Field experiments revealed an overall success rate of 70% for autonomous leaf grasping and hyperspectral measurement in a pistachio orchard. The open-source repository is available at: https://github.com/mehradmrt/UCM-AgBot-ROS2
中文: 为解决劳动力短缺和满足人口增长需求,本研究开发了RoMu4o果园机器人系统,通过先进的视觉与操控技术实现叶片抓取和高光谱检测,在实验室和田间试验中均展现出较高的成功率。
English: To tackle labor shortages and support growing populations, this research presents RoMu4o, a robotic system for orchard automation that uses advanced vision and manipulation to grasp leaves and perform hyperspectral sensing with high success rates in both lab and field conditions.

Authors:Delin An, Pan Du, Pengfei Gu, Jian-Xun Wang, Chaoli Wang
Title: Hierarchical LoG Bayesian Neural Network for Enhanced Aorta Segmentation
Abstract:
Accurate segmentation of the aorta and its associated arch branches is crucial for diagnosing aortic diseases. While deep learning techniques have significantly improved aorta segmentation, they remain challenging due to the intricate multiscale structure and the complexity of the surrounding tissues. This paper presents a novel approach for enhancing aorta segmentation using a Bayesian neural network-based hierarchical Laplacian of Gaussian (LoG) model. Our model consists of a 3D U-Net stream and a hierarchical LoG stream: the former provides an initial aorta segmentation, and the latter enhances blood vessel detection across varying scales by learning suitable LoG kernels, enabling self-adaptive handling of different parts of the aorta vessels with significant scale differences. We employ a Bayesian method to parameterize the LoG stream and provide confidence intervals for the segmentation results, ensuring robustness and reliability of the prediction for vascular medical image analysts. Experimental results show that our model can accurately segment main and supra-aortic vessels, yielding at least a 3% gain in the Dice coefficient over state-of-the-art methods across multiple volumes drawn from two aorta datasets, and can provide reliable confidence intervals for different parts of the aorta. The code is available at https://github.com/adlsn/LoGBNet.
中文: 本文提出一种基于贝叶斯神经网络的分层拉普拉斯高斯模型,通过结合三维U-Net与自适应多尺度血管检测来提升主动脉分割精度,在多个数据集上实现Dice系数至少3%的提升,并能为不同血管部位提供可靠置信区间。
English: This paper introduces a Bayesian neural network-based hierarchical Laplacian of Gaussian model that improves aorta segmentation by combining 3D U-Net with adaptive multi-scale vessel detection, achieving at least 3% higher Dice scores and providing reliable confidence intervals.

Authors:Ruixuan Zhang, Beichen Wang, Juexiao Zhang, Zilin Bian, Chen Feng, Kaan Ozbay
Title: When language and vision meet road safety: leveraging multimodal large language models for video-based traffic accident analysis
Abstract:
The increasing availability of traffic videos functioning on a 24/7/365 time scale has the great potential of increasing the spatio-temporal coverage of traffic accidents, which will help improve traffic safety. However, analyzing footage from hundreds, if not thousands, of traffic cameras in a 24/7/365 working protocol remains an extremely challenging task, as current vision-based approaches primarily focus on extracting raw information, such as vehicle trajectories or individual object detection, but require laborious post-processing to derive actionable insights. We propose SeeUnsafe, a new framework that integrates Multimodal Large Language Model (MLLM) agents to transform video-based traffic accident analysis from a traditional extraction-then-explanation workflow to a more interactive, conversational approach. This shift significantly enhances processing throughput by automating complex tasks like video classification and visual grounding, while improving adaptability by enabling seamless adjustments to diverse traffic scenarios and user-defined queries. Our framework employs a severity-based aggregation strategy to handle videos of various lengths and a novel multimodal prompt to generate structured responses for review and evaluation and enable fine-grained visual grounding. We introduce IMS (Information Matching Score), a new MLLM-based metric for aligning structured responses with ground truth. We conduct extensive experiments on the Toyota Woven Traffic Safety dataset, demonstrating that SeeUnsafe effectively performs accident-aware video classification and visual grounding by leveraging off-the-shelf MLLMs. Source code will be available at \url{https://github.com/ai4ce/SeeUnsafe}.
中文: SeeUnsafe框架通过多模态大语言模型代理,将传统交通事故视频分析转变为交互式对话处理,利用基于严重程度的聚合策略自动完成分类和视觉定位任务,显著提升了处理效率与场景适应性。
English: The SeeUnsafe framework utilizes Multimodal Large Language Model agents to revolutionize traffic accident analysis by enabling interactive, conversational processing of surveillance videos, automating classification and visual grounding tasks while adapting to diverse scenarios through a severity-based aggregation strategy.

Authors:Andrey Risukhin, Kavel Rao, Ben Caffee, Alan Fan
Title: ColorGrid: A Multi-Agent Non-Stationary Environment for Goal Inference and Assistance
Abstract:
Autonomous agents' interactions with humans are increasingly focused on adapting to their changing preferences in order to improve assistance in real-world tasks. Effective agents must learn to accurately infer human goals, which are often hidden, to collaborate well. However, existing Multi-Agent Reinforcement Learning (MARL) environments lack the necessary attributes required to rigorously evaluate these agents' learning capabilities. To this end, we introduce ColorGrid, a novel MARL environment with customizable non-stationarity, asymmetry, and reward structure. We investigate the performance of Independent Proximal Policy Optimization (IPPO), a state-of-the-art (SOTA) MARL algorithm, in ColorGrid and find through extensive ablations that, particularly with simultaneous non-stationary and asymmetric goals between a ``leader'' agent representing a human and a ``follower'' assistant agent, ColorGrid is unsolved by IPPO. To support benchmarking future MARL algorithms, we release our environment code, model checkpoints, and trajectory visualizations at https://github.com/andreyrisukhin/ColorGrid.
中文摘要:本文提出ColorGrid这一新型多智能体强化学习环境,旨在评估智能体适应非平稳和不对称人类偏好的能力,实验表明当前最先进的IPPO算法无法解决这些复杂场景。
English Summary: The paper introduces ColorGrid, a novel Multi-Agent Reinforcement Learning environment designed to evaluate agents' ability to adapt to non-stationary and asymmetric human preferences, where current state-of-the-art algorithms like IPPO fail to solve these complex scenarios.

Authors:Taehee Jeong
Title: 4bit-Quantization in Vector-Embedding for RAG
Abstract:
Retrieval-augmented generation (RAG) is a promising technique that has shown great potential in addressing some of the limitations of large language models (LLMs). LLMs have two major limitations: they can contain outdated information due to their training data, and they can generate factually inaccurate responses, a phenomenon known as hallucinations. RAG aims to mitigate these issues by leveraging a database of relevant documents, which are stored as embedding vectors in a high-dimensional space. However, one of the challenges of using high-dimensional embeddings is that they require a significant amount of memory to store. This can be a major issue, especially when dealing with large databases of documents. To alleviate this problem, we propose the use of 4-bit quantization to store the embedding vectors. This involves reducing the precision of the vectors from 32-bit floating-point numbers to 4-bit integers, which can significantly reduce the memory requirements. Our approach has several benefits. Firstly, it significantly reduces the memory storage requirements of the high-dimensional vector database, making it more feasible to deploy RAG systems in resource-constrained environments. Secondly, it speeds up the searching process, as the reduced precision of the vectors allows for faster computation. Our code is available at https://github.com/taeheej/4bit-Quantization-in-Vector-Embedding-for-RAG
中文: 本文提出在检索增强生成系统中使用4位量化存储嵌入向量,显著降低内存需求并加速检索过程,同时有效缓解大语言模型的过时信息和幻觉问题。
English: This paper introduces 4-bit quantization for embedding vectors in retrieval-augmented generation systems, significantly reducing memory storage requirements and accelerating search processes while addressing limitations of large language models.

Authors:Daniel Severo, Giuseppe Ottaviano, Matthew Muckley, Karen Ullrich, Matthijs Douze
Title: Lossless Compression of Vector IDs for Approximate Nearest Neighbor Search
Abstract:
Approximate nearest neighbor search for vectors relies on indexes that are most often accessed from RAM. Therefore, storage is the factor limiting the size of the database that can be served from a machine. Lossy vector compression, i.e., embedding quantization, has been applied extensively to reduce the size of indexes. However, for inverted file and graph-based indices, auxiliary data such as vector ids and links (edges) can represent most of the storage cost. We introduce and evaluate lossless compression schemes for these cases. These approaches are based on asymmetric numeral systems or wavelet trees that exploit the fact that the ordering of ids is irrelevant within the data structures. In some settings, we are able to compress the vector ids by a factor 7, with no impact on accuracy or search runtime. On billion-scale datasets, this results in a reduction of 30% of the index size. Furthermore, we show that for some datasets, these methods can also compress the quantized vector codes losslessly, by exploiting sub-optimalities in the original quantization algorithm. The source code for our approach available at https://github.com/facebookresearch/vector_db_id_compression.
中文: 本研究提出针对向量数据库索引的无损压缩技术,通过压缩向量ID和量化编码等辅助数据,在十亿级数据集上实现高达7倍的压缩比和30%的索引规模缩减,且不影响检索性能。
English: The study introduces lossless compression techniques for vector database indexes that reduce storage by compressing auxiliary data like vector IDs and quantized codes, achieving up to a 7-fold compression with no performance loss and a 30% index size reduction on billion-scale datasets.

Authors:Aitor Belenguer, Jose A. Pascual, Javier Navaridas
Title: GLow -- A Novel, Flower-Based Simulated Gossip Learning Strategy
Abstract:
Fully decentralized learning algorithms are still in an early stage of development. Creating modular Gossip Learning strategies is not trivial due to convergence challenges and Byzantine faults intrinsic in systems of decentralized nature. Our contribution provides a novel means to simulate custom Gossip Learning systems by leveraging the state-of-the-art Flower Framework. Specifically, we introduce GLow, which will allow researchers to train and assess scalability and convergence of devices, across custom network topologies, before making a physical deployment. The Flower Framework is selected for being a simulation featured library with a very active community on Federated Learning research. However, Flower exclusively includes vanilla Federated Learning strategies and, thus, is not originally designed to perform simulations without a centralized authority. GLow is presented to fill this gap and make simulation of Gossip Learning systems possible. Results achieved by GLow in the MNIST and CIFAR10 datasets, show accuracies over 0.98 and 0.75 respectively. More importantly, GLow performs similarly in terms of accuracy and convergence to its analogous Centralized and Federated approaches in all designed experiments.
中文摘要:完全去中心化学习面临收敛和拜占庭故障的挑战,而基于Flower框架开发的GLow工具实现了Gossip学习系统的可扩展模拟,在MNIST和CIFAR10数据集上达到了与中心化和联邦学习方法相媲美的高准确率。
English Summary: Fully decentralized learning faces challenges in convergence and Byzantine faults, but GLow, a novel tool built on the Flower Framework, enables scalable simulation of Gossip Learning systems, achieving high accuracy comparable to centralized and federated approaches on MNIST and CIFAR10 datasets.

Authors:Xiaolu Hou, Mingcheng Li, Dingkang Yang, Jiawei Chen, Ziyun Qian, Xiao Zhao, Yue Jiang, Jinjie Wei, Qingyao Xu, Lihua Zhang
Title: BloomScene: Lightweight Structured 3D Gaussian Splatting for Crossmodal Scene Generation
Abstract:
With the widespread use of virtual reality applications, 3D scene generation has become a new challenging research frontier. 3D scenes have highly complex structures and need to ensure that the output is dense, coherent, and contains all necessary structures. Many current 3D scene generation methods rely on pre-trained text-to-image diffusion models and monocular depth estimators. However, the generated scenes occupy large amounts of storage space and often lack effective regularisation methods, leading to geometric distortions. To this end, we propose BloomScene, a lightweight structured 3D Gaussian splatting for crossmodal scene generation, which creates diverse and high-quality 3D scenes from text or image inputs. Specifically, a crossmodal progressive scene generation framework is proposed to generate coherent scenes utilizing incremental point cloud reconstruction and 3D Gaussian splatting. Additionally, we propose a hierarchical depth prior-based regularization mechanism that utilizes multi-level constraints on depth accuracy and smoothness to enhance the realism and continuity of the generated scenes. Ultimately, we propose a structured context-guided compression mechanism that exploits structured hash grids to model the context of unorganized anchor attributes, which significantly eliminates structural redundancy and reduces storage overhead. Comprehensive experiments across multiple scenes demonstrate the significant potential and advantages of our framework compared with several baselines.
中文: 本文提出BloomScene,一种轻量级3D高斯溅射方法,通过跨模态渐进生成、分层深度正则化和结构化压缩机制,从文本或图像输入生成高质量3D场景,有效减少存储需求并提升场景真实感。
English: The abstract introduces BloomScene, a lightweight 3D Gaussian splatting method that generates high-quality 3D scenes from text or image inputs using crossmodal progressive generation, hierarchical depth regularization, and structured compression to reduce storage and enhance realism.

Authors:Emre Tasar
Title: Quantum-Enhanced Conformal Methods for Multi-Output Uncertainty: A Holistic Exploration and Experimental Analysis
Abstract:
In this paper, we propose a unified approach to harness quantum conformal methods for multi-output distributions, with a particular emphasis on two experimental paradigms: (i) a standard 2-qubit circuit scenario producing a four-dimensional outcome distribution, and (ii) a multi-basis measurement setting that concatenates measurement probabilities in different bases (Z, X, Y) into a twelve-dimensional output space. By combining a multioutput regression model (e.g., random forests) with distributional conformal prediction, we validate coverage and interval-set sizes on both simulated quantum data and multi-basis measurement data. Our results confirm that classical conformal prediction can effectively provide coverage guarantees even when the target probabilities derive from inherently quantum processes. Such synergy opens the door to next-generation quantum-classical hybrid frameworks, providing both improved interpretability and rigorous coverage for quantum machine learning tasks. All codes and full reproducible Colab notebooks are made available at https://github.com/detasar/QECMMOU.
中文: 本文提出了一种统一的量子-经典混合框架,将保形预测应用于多输出量子分布,验证了其在模拟量子电路和多基测量中的有效覆盖保证,同时提升了量子机器学习任务的可解释性。
English: This paper introduces a unified quantum-classical framework that applies conformal prediction to multi-output quantum distributions, demonstrating reliable coverage guarantees for both simulated quantum circuits and multi-basis measurements while enhancing interpretability in quantum machine learning.

Authors:Kartik Narayan, Vibashan VS, Vishal M. Patel
Title: FaceXBench: Evaluating Multimodal LLMs on Face Understanding
Abstract:
Multimodal Large Language Models (MLLMs) demonstrate impressive problem-solving abilities across a wide range of tasks and domains. However, their capacity for face understanding has not been systematically studied. To address this gap, we introduce FaceXBench, a comprehensive benchmark designed to evaluate MLLMs on complex face understanding tasks. FaceXBench includes 5,000 multimodal multiple-choice questions derived from 25 public datasets and a newly created dataset, FaceXAPI. These questions cover 14 tasks across 6 broad categories, assessing MLLMs' face understanding abilities in bias and fairness, face authentication, recognition, analysis, localization and tool retrieval. Using FaceXBench, we conduct an extensive evaluation of 26 open-source MLLMs alongside 2 proprietary models, revealing the unique challenges in complex face understanding tasks. We analyze the models across three evaluation settings: zero-shot, in-context task description, and chain-of-thought prompting. Our detailed analysis reveals that current MLLMs, including advanced models like GPT-4o, and GeminiPro 1.5, show significant room for improvement. We believe FaceXBench will be a crucial resource for developing MLLMs equipped to perform sophisticated face understanding. Code: https://github.com/Kartik-3004/facexbench
中文: FaceXBench 是一个综合性基准测试,旨在系统评估多模态大语言模型在复杂人脸理解任务中的能力,揭示了尽管这些模型具备广泛的问题解决能力,但在该领域仍有显著提升空间。
English: FaceXBench is a comprehensive benchmark introduced to systematically evaluate Multimodal Large Language Models' capabilities in complex face understanding, revealing significant room for improvement despite their broad problem-solving skills.

Authors:Weibo Gao, Qi Liu, Linan Yue, Fangzhou Yao, Rui Lv, Zheng Zhang, Hao Wang, Zhenya Huang
Title: Agent4Edu: Generating Learner Response Data by Generative Agents for Intelligent Education Systems
Abstract:
Personalized learning represents a promising educational strategy within intelligent educational systems, aiming to enhance learners' practice efficiency. However, the discrepancy between offline metrics and online performance significantly impedes their progress. To address this challenge, we introduce Agent4Edu, a novel personalized learning simulator leveraging recent advancements in human intelligence through large language models (LLMs). Agent4Edu features LLM-powered generative agents equipped with learner profile, memory, and action modules tailored to personalized learning algorithms. The learner profiles are initialized using real-world response data, capturing practice styles and cognitive factors. Inspired by human psychology theory, the memory module records practice facts and high-level summaries, integrating reflection mechanisms. The action module supports various behaviors, including exercise understanding, analysis, and response generation. Each agent can interact with personalized learning algorithms, such as computerized adaptive testing, enabling a multifaceted evaluation and enhancement of customized services. Through a comprehensive assessment, we explore the strengths and weaknesses of Agent4Edu, emphasizing the consistency and discrepancies in responses between agents and human learners. The code, data, and appendix are publicly available at https://github.com/bigdata-ustc/Agent4Edu.
中文摘要:Agent4Edu是一种创新的个性化学习模拟器,通过配备学习者画像、记忆和行动模块的LLM生成智能体,模拟人类学习过程来评估和优化个性化教育算法。
English Summary: Agent4Edu is a novel personalized learning simulator that uses LLM-powered generative agents with specialized modules to evaluate and enhance personalized learning algorithms by simulating human-like interactions and cognitive processes.

Authors:Xi Yang, Haoyuan Shi, Zihan Wang, Nannan Wang, Xinbo Gao
Title: CSHNet: A Novel Information Asymmetric Image Translation Method
Abstract:
Despite advancements in cross-domain image translation, challenges persist in asymmetric tasks such as SAR-to-Optical and Sketch-to-Instance conversions, which involve transforming data from a less detailed domain into one with richer content. Traditional CNN-based methods are effective at capturing fine details but struggle with global structure, leading to unwanted merging of image regions. To address this, we propose the CNN-Swin Hybrid Network (CSHNet), which combines two key modules: Swin Embedded CNN (SEC) and CNN Embedded Swin (CES), forming the SEC-CES-Bottleneck (SCB). SEC leverages CNN's detailed feature extraction while integrating the Swin Transformer's structural bias. CES, in turn, preserves the Swin Transformer's global integrity, compensating for CNN's lack of focus on structure. Additionally, CSHNet includes two components designed to enhance cross-domain information retention: the Interactive Guided Connection (IGC), which enables dynamic information exchange between SEC and CES, and Adaptive Edge Perception Loss (AEPL), which maintains structural boundaries during translation. Experimental results show that CSHNet outperforms existing methods in both visual quality and performance metrics across scene-level and instance-level datasets. Our code is available at: https://github.com/XduShi/CSHNet.
中文摘要:提出的CNN-Swin混合网络(CSHNet)通过结合局部特征提取与全局结构保持的创新模块及损失函数,有效解决了非对称图像转换中的结构失真问题,在多个数据集上实现了最优性能。
English Summary: The proposed CNN-Swin Hybrid Network (CSHNet) effectively addresses structural limitations in asymmetric image translation by integrating detailed feature extraction with global structural preservation through novel modules and loss functions, achieving superior performance across multiple datasets.

Authors:Kazuma Onishi, Katsuhiko Hayashi
Title: A Simple but Effective Closed-form Solution for Extreme Multi-label Learning
Abstract:
Extreme multi-label learning (XML) is a task of assigning multiple labels from an extremely large set of labels to each data instance. Many current high-performance XML models are composed of a lot of hyperparameters, which complicates the tuning process. Additionally, the models themselves are adapted specifically to XML, which complicates their reimplementation. To remedy this problem, we propose a simple method based on ridge regression for XML. The proposed method not only has a closed-form solution but also is composed of a single hyperparameter. Since there are no precedents on applying ridge regression to XML, this paper verified the performance of the method by using various XML benchmark datasets. Furthermore, we enhanced the prediction of low-frequency labels in XML, which hold informative content. This prediction is essential yet challenging because of the limited amount of data. Here, we employed a simple frequency-based weighting. This approach greatly simplifies the process compared with existing techniques. Experimental results revealed that it can achieve levels of performance comparable to, or even exceeding, those of models with numerous hyperparameters. Additionally, we found that the frequency-based weighting significantly improved the predictive performance for low-frequency labels, while requiring almost no changes in implementation. The source code for the proposed method is available on github at https://github.com/cars1015/XML-ridge.
中文: 本文提出了一种基于岭回归的极大多标签学习方法,仅使用单一超参数,并通过频率加权提升低频标签的预测效果,其性能达到甚至超越了复杂模型。
English: This paper introduces a simple ridge regression-based method for extreme multi-label learning that uses only one hyperparameter and employs frequency-based weighting to enhance predictions for low-frequency labels, achieving performance comparable to or better than complex models.

Authors:Mengran Li, Junzhou Chen, Chenyun Yu, Guanying Jiang, Ronghui Zhang, Yanming Shen, Houbing Herbert Song
Title: Topology-Driven Attribute Recovery for Attribute Missing Graph Learning in Social Internet of Things
Abstract:
With the advancement of information technology, the Social Internet of Things (SIoT) has fostered the integration of physical devices and social networks, deepening the study of complex interaction patterns. Text Attribute Graphs (TAGs) capture both topological structures and semantic attributes, enhancing the analysis of complex interactions within the SIoT. However, existing graph learning methods are typically designed for complete attributed graphs, and the common issue of missing attributes in Attribute Missing Graphs (AMGs) increases the difficulty of analysis tasks. To address this, we propose the Topology-Driven Attribute Recovery (TDAR) framework, which leverages topological data for AMG learning. TDAR introduces an improved pre-filling method for initial attribute recovery using native graph topology. Additionally, it dynamically adjusts propagation weights and incorporates homogeneity strategies within the embedding space to suit AMGs' unique topological structures, effectively reducing noise during information propagation. Extensive experiments on public datasets demonstrate that TDAR significantly outperforms state-of-the-art methods in attribute reconstruction and downstream tasks, offering a robust solution to the challenges posed by AMGs. The code is available at https://github.com/limengran98/TDAR.
中文: 提出的拓扑驱动属性恢复(TDAR)框架通过利用拓扑结构和动态传播策略,有效解决了属性缺失图中的数据不完整问题,在属性重建和下游任务中显著优于现有方法。
English: The proposed Topology-Driven Attribute Recovery (TDAR) framework effectively addresses attribute missing issues in graphs by leveraging topological structures and dynamic propagation strategies, demonstrating superior performance in attribute reconstruction and downstream tasks compared to existing methods.

Authors:Lucen Zhong, Zhengxiao Du, Xiaohan Zhang, Haiyi Hu, Jie Tang
Title: ComplexFuncBench: Exploring Multi-Step and Constrained Function Calling under Long-Context Scenario
Abstract:
Enhancing large language models (LLMs) with real-time APIs can help generate more accurate and up-to-date responses. However, evaluating the function calling abilities of LLMs in real-world scenarios remains under-explored due to the complexity of data collection and evaluation. In this work, we introduce ComplexFuncBench, a benchmark for complex function calling across five real-world scenarios. Compared to existing benchmarks, ComplexFuncBench encompasses multi-step and constrained function calling, which requires long-parameter filing, parameter value reasoning, and 128k long context. Additionally, we propose an automatic framework, ComplexEval, for quantitatively evaluating complex function calling tasks. Through comprehensive experiments, we demonstrate the deficiencies of state-of-the-art LLMs in function calling and suggest future directions for optimizing these capabilities. The data and code are available at \url{https://github.com/THUDM/ComplexFuncBench}.
中文摘要:本研究提出了ComplexFuncBench基准,用于评估大语言模型在现实场景中的复杂函数调用能力,并开发了自动评估框架ComplexEval,揭示了当前模型的不足并为未来优化指明了方向。
English Summary: This study introduces ComplexFuncBench, a benchmark for evaluating complex function calling in large language models across real-world scenarios, and proposes an automatic evaluation framework called ComplexEval to identify deficiencies in current models and guide future improvements.

Authors:Yichen He, Guanhua Huang, Peiyuan Feng, Yuan Lin, Yuchen Zhang, Hang Li, Weinan E
Title: PaSa: An LLM Agent for Comprehensive Academic Paper Search
Abstract:
We introduce PaSa, an advanced Paper Search agent powered by large language models. PaSa can autonomously make a series of decisions, including invoking search tools, reading papers, and selecting relevant references, to ultimately obtain comprehensive and accurate results for complex scholar queries. We optimize PaSa using reinforcement learning with a synthetic dataset, AutoScholarQuery, which includes 35k fine-grained academic queries and corresponding papers sourced from top-tier AI conference publications. Additionally, we develop RealScholarQuery, a benchmark collecting real-world academic queries to assess PaSa performance in more realistic scenarios. Despite being trained on synthetic data, PaSa significantly outperforms existing baselines on RealScholarQuery, including Google, Google Scholar, Google with GPT-4o for paraphrased queries, ChatGPT (search-enabled GPT-4o), GPT-o1, and PaSa-GPT-4o (PaSa implemented by prompting GPT-4o). Notably, PaSa-7B surpasses the best Google-based baseline, Google with GPT-4o, by 37.78% in recall@20 and 39.90% in recall@50, and exceeds PaSa-GPT-4o by 30.36% in recall and 4.25% in precision. Model, datasets, and code are available at https://github.com/bytedance/pasa.
Chinese: PaSa是一种基于大语言模型的高级论文搜索代理,能通过工具调用和强化学习自主处理复杂学术查询,尽管使用合成数据训练,但在真实场景基准测试中显著优于现有基线模型。
English: PaSa is an advanced paper search agent powered by large language models that autonomously handles complex academic queries through tool invocation and reinforcement learning, significantly outperforming existing baselines on real-world benchmarks despite synthetic training.

Authors:Jinliang Zheng, Jianxiong Li, Dongxiu Liu, Yinan Zheng, Zhihao Wang, Zhonghong Ou, Yu Liu, Jingjing Liu, Ya-Qin Zhang, Xianyuan Zhan
Title: Universal Actions for Enhanced Embodied Foundation Models
Abstract:
Training on diverse, internet-scale data is a key factor in the success of recent large foundation models. Yet, using the same recipe for building embodied agents has faced noticeable difficulties. Despite the availability of many crowd-sourced embodied datasets, their action spaces often exhibit significant heterogeneity due to distinct physical embodiment and control interfaces for different robots, causing substantial challenges in developing embodied foundation models using cross-domain data. In this paper, we introduce UniAct, a new embodied foundation modeling framework operating in a Universal Action Space. Our learned universal actions capture the generic atomic behaviors across diverse robots by exploiting their shared structural features, and enable enhanced cross-domain data utilization and cross-embodiment generalizations by eliminating the notorious heterogeneity. The universal actions can be efficiently translated back to heterogeneous actionable commands by simply adding embodiment-specific details, from which fast adaptation to new robots becomes simple and straightforward. Our 0.5B instantiation of UniAct outperforms 14X larger SOTA embodied foundation models in extensive evaluations on various real-world and simulation robots, showcasing exceptional cross-embodiment control and adaptation capability, highlighting the crucial benefit of adopting universal actions. Project page: https://github.com/2toinf/UniAct
中文摘要:UniAct提出通用动作空间框架,通过提取不同机器人的共享结构特征学习通用原子行为,有效消除异构性,在跨领域泛化和跨具身控制方面展现出卓越性能,以更小模型规模超越现有大型具身基础模型。
English Summary: UniAct introduces a universal action space framework that overcomes heterogeneity in embodied AI by learning shared atomic behaviors across robots, enabling superior cross-domain generalization and outperforming larger models in real-world and simulation evaluations.

Authors:Michael Schwingshackl, Fabio Francisco Oberweger, Markus Murschitz
Title: Few-shot Structure-Informed Machinery Part Segmentation with Foundation Models and Graph Neural Networks
Abstract:
This paper proposes a novel approach to few-shot semantic segmentation for machinery with multiple parts that exhibit spatial and hierarchical relationships. Our method integrates the foundation models CLIPSeg and Segment Anything Model (SAM) with the interest point detector SuperPoint and a graph convolutional network (GCN) to accurately segment machinery parts. By providing 1 to 25 annotated samples, our model, evaluated on a purely synthetic dataset depicting a truck-mounted loading crane, achieves effective segmentation across various levels of detail. Training times are kept under five minutes on consumer GPUs. The model demonstrates robust generalization to real data, achieving a qualitative synthetic-to-real generalization with a $J\&F$ score of 92.2 on real data using 10 synthetic support samples. When benchmarked on the DAVIS 2017 dataset, it achieves a $J\&F$ score of 71.5 in semi-supervised video segmentation with three support samples. This method's fast training times and effective generalization to real data make it a valuable tool for autonomous systems interacting with machinery and infrastructure, and illustrate the potential of combined and orchestrated foundation models for few-shot segmentation tasks.
中文: 本文提出了一种新颖的少样本机械部件语义分割方法,通过结合CLIPSeg、SAM、SuperPoint和图卷积网络,实现了快速训练和仅需少量标注样本即可有效泛化到真实数据的能力。
English: This paper introduces a novel few-shot semantic segmentation method for machinery parts by integrating CLIPSeg, SAM, SuperPoint, and GCN, achieving rapid training and robust generalization to real data with minimal annotated samples.

Authors:Ali Can Karaca, M. Enes Ozelbas, Saadettin Berber, Orkhan Karimli, Turabi Yildirim, M. Fatih Amasyali
Title: Robust Change Captioning in Remote Sensing: SECOND-CC Dataset and MModalCC Framework
Abstract:
Remote sensing change captioning (RSICC) aims to describe changes between bitemporal images in natural language. Existing methods often fail under challenges like illumination differences, viewpoint changes, blur effects, leading to inaccuracies, especially in no-change regions. Moreover, the images acquired at different spatial resolutions and have registration errors tend to affect the captions. To address these issues, we introduce SECOND-CC, a novel RSICC dataset featuring high-resolution RGB image pairs, semantic segmentation maps, and diverse real-world scenarios. SECOND-CC which contains 6,041 pairs of bitemporal RS images and 30,205 sentences describing the differences between images. Additionally, we propose MModalCC, a multimodal framework that integrates semantic and visual data using advanced attention mechanisms, including Cross-Modal Cross Attention (CMCA) and Multimodal Gated Cross Attention (MGCA). Detailed ablation studies and attention visualizations further demonstrate its effectiveness and ability to address RSICC challenges. Comprehensive experiments show that MModalCC outperforms state-of-the-art RSICC methods, including RSICCformer, Chg2Cap, and PSNet with +4.6% improvement on BLEU4 score and +9.6% improvement on CIDEr score. We will make our dataset and codebase publicly available to facilitate future research at https://github.com/ChangeCapsInRS/SecondCC
Chinese: 本研究提出了高分辨率遥感数据集SECOND-CC和融合语义与视觉信息的多模态框架MModalCC,通过先进的注意力机制显著提升了变化描述准确性,在BLEU4和CIDEr指标上分别实现了4.6%和9.6%的性能提升。
English: The study introduces SECOND-CC, a high-resolution remote sensing dataset, and MModalCC, a multimodal framework that integrates semantic and visual data to improve change captioning accuracy, achieving state-of-the-art performance with significant gains in BLEU4 and CIDEr scores.

Authors:Xinzhe Li
Title: A Survey on LLM Test-Time Compute via Search: Tasks, LLM Profiling, Search Algorithms, and Relevant Frameworks
Abstract:
LLM test-time compute (or LLM inference) via search has emerged as a promising research area with rapid developments. However, current frameworks often adopt distinct perspectives on three key aspects: task definition, LLM profiling, and search procedures, making direct comparisons challenging. Moreover, the search algorithms employed often diverge from standard implementations, and their specific characteristics are not thoroughly specified. This survey aims to provide a comprehensive but integrated technical review on existing LIS frameworks. Specifically, we unify task definitions under Markov Decision Process (MDP) and provides modular definitions of LLM profiling and search procedures. The definitions enable precise comparisons of various LLM inference frameworks while highlighting their departures from conventional search algorithms. We also discuss the applicability, performance, and efficiency of these methods. For ongoing paper updates, please refer to our GitHub repository: https://github.com/xinzhel/LLM-Search.
中文: 本综述通过将任务定义、大语言模型剖析和搜索流程统一在马尔可夫决策过程框架下,对现有大语言模型推理方法进行了整合性技术评述,实现了精确比较并揭示了它们与传统搜索算法的差异。
English: This survey offers a unified technical review of LLM inference frameworks by standardizing task definitions, profiling, and search procedures under a Markov Decision Process, enabling precise comparisons and highlighting deviations from conventional algorithms.

Authors:Zhaopeng Gu, Bingke Zhu, Guibo Zhu, Yingying Chen, Ming Tang, Jinqiao Wang
Title: FiLo++: Zero-/Few-Shot Anomaly Detection by Fused Fine-Grained Descriptions and Deformable Localization
Abstract:
Anomaly detection methods typically require extensive normal samples from the target class for training, limiting their applicability in scenarios that require rapid adaptation, such as cold start. Zero-shot and few-shot anomaly detection do not require labeled samples from the target class in advance, making them a promising research direction. Existing zero-shot and few-shot approaches often leverage powerful multimodal models to detect and localize anomalies by comparing image-text similarity. However, their handcrafted generic descriptions fail to capture the diverse range of anomalies that may emerge in different objects, and simple patch-level image-text matching often struggles to localize anomalous regions of varying shapes and sizes. To address these issues, this paper proposes the FiLo++ method, which consists of two key components. The first component, Fused Fine-Grained Descriptions (FusDes), utilizes large language models to generate anomaly descriptions for each object category, combines both fixed and learnable prompt templates and applies a runtime prompt filtering method, producing more accurate and task-specific textual descriptions. The second component, Deformable Localization (DefLoc), integrates the vision foundation model Grounding DINO with position-enhanced text descriptions and a Multi-scale Deformable Cross-modal Interaction (MDCI) module, enabling accurate localization of anomalies with various shapes and sizes. In addition, we design a position-enhanced patch matching approach to improve few-shot anomaly detection performance. Experiments on multiple datasets demonstrate that FiLo++ achieves significant performance improvements compared with existing methods. Code will be available at https://github.com/CASIA-IVA-Lab/FiLo.
中文: FiLo++方法通过大语言模型生成针对特定类别的精确异常描述,并结合可变形定位模块准确识别各种形状大小的异常,在多个数据集上实现了优于现有方法的零样本和少样本异常检测性能。
English: The FiLo++ method enhances zero-shot and few-shot anomaly detection by generating precise, category-specific anomaly descriptions using large language models and employing a deformable localization module for accurate identification of diverse anomalies, achieving superior performance across multiple datasets.

Authors:Shengkui Zhao, Zexu Pan, Kun Zhou, Yukun Ma, Chong Zhang, Bin Ma
Title: Conditional Latent Diffusion-Based Speech Enhancement Via Dual Context Learning
Abstract:
Recently, the application of diffusion probabilistic models has advanced speech enhancement through generative approaches. However, existing diffusion-based methods have focused on the generation process in high-dimensional waveform or spectral domains, leading to increased generation complexity and slower inference speeds. Additionally, these methods have primarily modelled clean speech distributions, with limited exploration of noise distributions, thereby constraining the discriminative capability of diffusion models for speech enhancement. To address these issues, we propose a novel approach that integrates a conditional latent diffusion model (cLDM) with dual-context learning (DCL). Our method utilizes a variational autoencoder (VAE) to compress mel-spectrograms into a low-dimensional latent space. We then apply cLDM to transform the latent representations of both clean speech and background noise into Gaussian noise by the DCL process, and a parameterized model is trained to reverse this process, conditioned on noisy latent representations and text embeddings. By operating in a lower-dimensional space, the latent representations reduce the complexity of the generation process, while the DCL process enhances the model's ability to handle diverse and unseen noise environments. Our experiments demonstrate the strong performance of the proposed approach compared to existing diffusion-based methods, even with fewer iterative steps, and highlight the superior generalization capability of our models to out-of-domain noise datasets (https://github.com/modelscope/ClearerVoice-Studio).
中文: 本文提出了一种结合条件潜在扩散模型与双上下文学习的新方法,通过在压缩潜在空间中同时对纯净语音和噪声分布进行建模,以更低计算复杂度实现了优异的语音增强效果和泛化能力。
English: This paper introduces a conditional latent diffusion model with dual-context learning that operates in a compressed latent space to enhance speech by efficiently modeling both clean speech and noise distributions, achieving superior performance and generalization with reduced computational complexity.

Authors:Shengkui Zhao, Kun Zhou, Zexu Pan, Yukun Ma, Chong Zhang, Bin Ma
Title: HiFi-SR: A Unified Generative Transformer-Convolutional Adversarial Network for High-Fidelity Speech Super-Resolution
Abstract:
The application of generative adversarial networks (GANs) has recently advanced speech super-resolution (SR) based on intermediate representations like mel-spectrograms. However, existing SR methods that typically rely on independently trained and concatenated networks may lead to inconsistent representations and poor speech quality, especially in out-of-domain scenarios. In this work, we propose HiFi-SR, a unified network that leverages end-to-end adversarial training to achieve high-fidelity speech super-resolution. Our model features a unified transformer-convolutional generator designed to seamlessly handle both the prediction of latent representations and their conversion into time-domain waveforms. The transformer network serves as a powerful encoder, converting low-resolution mel-spectrograms into latent space representations, while the convolutional network upscales these representations into high-resolution waveforms. To enhance high-frequency fidelity, we incorporate a multi-band, multi-scale time-frequency discriminator, along with a multi-scale mel-reconstruction loss in the adversarial training process. HiFi-SR is versatile, capable of upscaling any input speech signal between 4 kHz and 32 kHz to a 48 kHz sampling rate. Experimental results demonstrate that HiFi-SR significantly outperforms existing speech SR methods across both objective metrics and ABX preference tests, for both in-domain and out-of-domain scenarios (https://github.com/modelscope/ClearerVoice-Studio).
中文: HiFi-SR采用统一的端到端对抗网络,结合Transformer-卷积生成器和多尺度鉴别器,能将4-32kHz语音信号超分辨率提升至48kHz,在各类场景下均实现保真度显著优化的语音增强效果。
English: HiFi-SR introduces a unified end-to-end adversarial network with a transformer-convolutional generator and multi-scale discriminators, achieving superior speech super-resolution by upscaling signals from 4-32 kHz to 48 kHz with enhanced fidelity across diverse scenarios.

Authors:Victor Barbier, Eric Jeangirard
Title: Mapping scientific communities at scale
Abstract:
This study introduces a novel methodology for mapping scientific communities at scale, addressing challenges associated with network analysis in large bibliometric datasets. By leveraging enriched publication metadata from the French research portal scanR and applying advanced filtering techniques to prioritize the strongest interactions between entities, we construct detailed, scalable network maps. These maps are enhanced through systematic disambiguation of authors, affiliations, and topics using persistent identifiers and specialized algorithms. The proposed framework integrates Elasticsearch for efficient data aggregation, Graphology for network spatialization (Force Atltas2) and community detection (Louvain algorithm) and VOSviewer for network vizualization. A Large Language Model (Mistral Nemo) is used to label the communities detected and OpenAlex data helps to enrich the results with citation counts estimation to detect hot topics. This scalable approach enables insightful exploration of research collaborations and thematic structures, with potential applications for strategic decision-making in science policy and funding. These web tools are effective at the global (national) scale but are also available (and can be integrated via iframes) on the perimeter of any French research institution (from large research organisms to any laboratory). The scanR community analysis tool is available online [https://scanr.enseignementsup-recherche.gouv.fr/networks/get-started](https://scanr.enseignementsup-recherche.gouv.fr/networks/get-started). All tools and methodologies are open-source on the repo [https://github.com/dataesr/scanr-ui](https://github.com/dataesr/scanr-ui)
本研究提出了一种可扩展的科学社群映射框架,利用增强的文献计量数据、高级过滤技术和人工智能标注,为科研政策制定提供可视化的研究合作与主题结构分析。
This study presents a scalable framework for mapping scientific communities using enriched bibliometric data, advanced filtering, and AI-powered labeling to visualize research collaborations and thematic structures for science policy applications.

Authors:Di Chang, Hongyi Xu, You Xie, Yipeng Gao, Zhengfei Kuang, Shengqu Cai, Chenxu Zhang, Guoxian Song, Chao Wang, Yichun Shi, Zeyuan Chen, Shijie Zhou, Linjie Luo, Gordon Wetzstein, Mohammad Soleymani
Title: X-Dyna: Expressive Dynamic Human Image Animation
Abstract:
We introduce X-Dyna, a novel zero-shot, diffusion-based pipeline for animating a single human image using facial expressions and body movements derived from a driving video, that generates realistic, context-aware dynamics for both the subject and the surrounding environment. Building on prior approaches centered on human pose control, X-Dyna addresses key shortcomings causing the loss of dynamic details, enhancing the lifelike qualities of human video animations. At the core of our approach is the Dynamics-Adapter, a lightweight module that effectively integrates reference appearance context into the spatial attentions of the diffusion backbone while preserving the capacity of motion modules in synthesizing fluid and intricate dynamic details. Beyond body pose control, we connect a local control module with our model to capture identity-disentangled facial expressions, facilitating accurate expression transfer for enhanced realism in animated scenes. Together, these components form a unified framework capable of learning physical human motion and natural scene dynamics from a diverse blend of human and scene videos. Comprehensive qualitative and quantitative evaluations demonstrate that X-Dyna outperforms state-of-the-art methods, creating highly lifelike and expressive animations. The code is available at https://github.com/bytedance/X-Dyna.
中文: X-Dyna 是一种基于零样本扩散的动画生成方法,通过驱动视频中的面部表情和身体动作来激活单张人物图像,利用动态适配器和局部控制模块实现人物与环境的逼真动态效果。
English: X-Dyna is a zero-shot diffusion-based pipeline that animates a single human image by transferring facial expressions and body movements from a driving video, generating realistic dynamics for both the subject and environment through its Dynamics-Adapter and local control modules.

Authors:Xigui Li, Yuanye Zhou, Feiyang Xiao, Xin Guo, Yichi Zhang, Chen Jiang, Jianchao Ge, Xiansheng Wang, Qimeng Wang, Taiwei Zhang, Chensen Lin, Yuan Cheng, Yuan Qi
Title: Aneumo: A Large-Scale Comprehensive Synthetic Dataset of Aneurysm Hemodynamics
Abstract:
Intracranial aneurysm (IA) is a common cerebrovascular disease that is usually asymptomatic but may cause severe subarachnoid hemorrhage (SAH) if ruptured. Although clinical practice is usually based on individual factors and morphological features of the aneurysm, its pathophysiology and hemodynamic mechanisms remain controversial. To address the limitations of current research, this study constructed a comprehensive hemodynamic dataset of intracranial aneurysms. The dataset is based on 466 real aneurysm models, and 10,000 synthetic models were generated by resection and deformation operations, including 466 aneurysm-free models and 9,534 deformed aneurysm models. The dataset also provides medical image-like segmentation mask files to support insightful analysis. In addition, the dataset contains hemodynamic data measured at eight steady-state flow rates (0.001 to 0.004 kg/s), including critical parameters such as flow velocity, pressure, and wall shear stress, providing a valuable resource for investigating aneurysm pathogenesis and clinical prediction. This dataset will help advance the understanding of the pathologic features and hemodynamic mechanisms of intracranial aneurysms and support in-depth research in related fields. Dataset hosted at https://github.com/Xigui-Li/Aneumo.
中文: 本研究基于466个真实和1万个合成颅内动脉瘤模型构建了全面的血流动力学数据集,提供流速、压力和壁面剪应力等关键参数,将推动对动脉瘤发病机制和血流动力学机理的研究。
English: This study developed a comprehensive hemodynamic dataset using 466 real and 10,000 synthetic intracranial aneurysm models, providing flow velocity, pressure, and wall shear stress data to advance understanding of aneurysm pathogenesis and mechanisms.

Authors:Changze Lv, Jingwen Xu, Yiyang Lu, Xiaohua Wang, Zhenghua Wang, Zhibo Xu, Di Yu, Xin Du, Xiaoqing Zheng, Xuanjing Huang
Title: Dendritic Localized Learning: Toward Biologically Plausible Algorithm
Abstract:
Backpropagation is the foundational algorithm for training neural networks and a key driver of deep learning's success. However, its biological plausibility has been challenged due to three primary limitations: weight symmetry, reliance on global error signals, and the dual-phase nature of training, as highlighted by the existing literature. Although various alternative learning approaches have been proposed to address these issues, most either fail to satisfy all three criteria simultaneously or yield suboptimal results. Inspired by the dynamics and plasticity of pyramidal neurons, we propose Dendritic Localized Learning (DLL), a novel learning algorithm designed to overcome these challenges. Extensive empirical experiments demonstrate that DLL satisfies all three criteria of biological plausibility while achieving state-of-the-art performance among algorithms that meet these requirements. Furthermore, DLL exhibits strong generalization across a range of architectures, including MLPs, CNNs, and RNNs. These results, benchmarked against existing biologically plausible learning algorithms, offer valuable empirical insights for future research. We hope this study can inspire the development of new biologically plausible algorithms for training multilayer networks and advancing progress in both neuroscience and machine learning. Our code is available at https://github.com/Lvchangze/Dendritic-Localized-Learning.
中文摘要:本文提出的树突局部学习(DLL)算法通过满足生物合理性的三个标准,在解决反向传播局限性的同时,在多种神经网络架构中实现了最优性能。
English Summary: The proposed Dendritic Localized Learning (DLL) algorithm addresses biological plausibility limitations of backpropagation by satisfying all three criteria while achieving state-of-the-art performance across various neural network architectures.

Authors:J. Pablo Muñoz, Jinjie Yuan, Nilesh Jain
Title: MultiPruner: Balanced Structure Removal in Foundation Models
Abstract:
Recently, state-of-the-art approaches for pruning large pre-trained models (LPMs) have demonstrated that the training-free removal of non-critical residual blocks in Transformers is viable for reducing model size, achieving results that outperform previous training-free pruning approaches. Motivated by these findings, we extend BlockPruner (Zhong et al., 2024) and propose MultiPruner, a pruning approach that surpasses recent training-free pruning methods by adopting a multidimensional, iterative, fine-grained pruning strategy. In MultiPruner, multidimensional pruning reinstates the structural balance in block-pruned models by sequentially compressing along three dimensions: i) residual blocks, ii) channels of multilayer perceptrons (MLP), and iii) attention heads. This solution enhances zero-shot accuracy on downstream tasks compared to other techniques while improving model compression ratios, producing compressed models with fewer computing and memory requirements. Extensive experiments demonstrate the advantages of the proposed method across various large pre-trained models. The code and pruning configurations are available at https://github.com/IntelLabs/Hardware-Aware-Automated-Machine-Learning.
中文:MultiPruner通过多维修剪策略,依次压缩残差块、多层感知机通道和注意力头,在无需训练的情况下提升了零样本准确率并实现了更高的模型压缩比。
English: MultiPruner advances pruning of large pre-trained models by employing a multidimensional strategy that compresses residual blocks, MLP channels, and attention heads, achieving superior zero-shot accuracy and higher compression ratios without training.

Authors:Xiaoyun Zheng, Liwei Liao, Jianbo Jiao, Feng Gao, Ronggang Wang
Title: Surface-SOS: Self-Supervised Object Segmentation via Neural Surface Representation
Abstract:
Self-supervised Object Segmentation (SOS) aims to segment objects without any annotations. Under conditions of multi-camera inputs, the structural, textural and geometrical consistency among each view can be leveraged to achieve fine-grained object segmentation. To make better use of the above information, we propose Surface representation based Self-supervised Object Segmentation (Surface-SOS), a new framework to segment objects for each view by 3D surface representation from multi-view images of a scene. To model high-quality geometry surfaces for complex scenes, we design a novel scene representation scheme, which decomposes the scene into two complementary neural representation modules respectively with a Signed Distance Function (SDF). Moreover, Surface-SOS is able to refine single-view segmentation with multi-view unlabeled images, by introducing coarse segmentation masks as additional input. To the best of our knowledge, Surface-SOS is the first self-supervised approach that leverages neural surface representation to break the dependence on large amounts of annotated data and strong constraints. These constraints typically involve observing target objects against a static background or relying on temporal supervision in videos. Extensive experiments on standard benchmarks including LLFF, CO3D, BlendedMVS, TUM and several real-world scenes show that Surface-SOS always yields finer object masks than its NeRF-based counterparts and surpasses supervised single-view baselines remarkably. Code is available at: https://github.com/zhengxyun/Surface-SOS.
中文: Surface-SOS是一种创新的自监督物体分割框架,通过多视角图像的3D表面表示实现无需标注的精细分割,在多个基准测试中显著优于现有方法。
English: Surface-SOS is a novel self-supervised object segmentation framework that utilizes 3D surface representations from multi-view images to achieve fine-grained segmentation without annotations, outperforming existing methods on multiple benchmarks.

Authors:Fausto German, Brian Keith, Mauricio Matus, Diego Urrutia, Claudio Meneses
Title: Semi-Supervised Image-Based Narrative Extraction: A Case Study with Historical Photographic Records
Abstract:
This paper presents a semi-supervised approach to extracting narratives from historical photographic records using an adaptation of the narrative maps algorithm. We extend the original unsupervised text-based method to work with image data, leveraging deep learning techniques for visual feature extraction and similarity computation. Our method is applied to the ROGER dataset, a collection of photographs from the 1928 Sacambaya Expedition in Bolivia captured by Robert Gerstmann. We compare our algorithmically extracted visual narratives with expert-curated timelines of varying lengths (5 to 30 images) to evaluate the effectiveness of our approach. In particular, we use the Dynamic Time Warping (DTW) algorithm to match the extracted narratives with the expert-curated baseline. In addition, we asked an expert on the topic to qualitatively evaluate a representative example of the resulting narratives. Our findings show that the narrative maps approach generally outperforms random sampling for longer timelines (10+ images, p < 0.05), with expert evaluation confirming the historical accuracy and coherence of the extracted narratives. This research contributes to the field of computational analysis of visual cultural heritage, offering new tools for historians, archivists, and digital humanities scholars to explore and understand large-scale image collections. The method's ability to generate meaningful narratives from visual data opens up new possibilities for the study and interpretation of historical events through photographic evidence.
本文提出一种半监督方法,通过改进叙事地图算法并利用深度学习技术从历史照片中提取视觉叙事,在ROGER数据集上的实验表明该方法在长序列中优于随机采样,且专家评估确认了叙事的历史准确性与连贯性。
This paper introduces a semi-supervised method that adapts narrative maps to extract visual narratives from historical photos using deep learning, validated on the ROGER dataset where it outperforms random sampling for longer sequences and gains expert approval for historical coherence.

Authors:Jingchen Sun, Shaobo Han, Wataru Kohno, Changyou Chen
Title: CLAP-S: Support Set Based Adaptation for Downstream Fiber-optic Acoustic Recognition
Abstract:
Contrastive Language-Audio Pretraining (CLAP) models have demonstrated unprecedented performance in various acoustic signal recognition tasks. Fiber-optic-based acoustic recognition is one of the most important downstream tasks and plays a significant role in environmental sensing. Adapting CLAP for fiber-optic acoustic recognition has become an active research area. As a non-conventional acoustic sensor, fiber-optic acoustic recognition presents a challenging, domain-specific, low-shot deployment environment with significant domain shifts due to unique frequency response and noise characteristics. To address these challenges, we propose a support-based adaptation method, CLAP-S, which linearly interpolates a CLAP Adapter with the Support Set, leveraging both implicit knowledge through fine-tuning and explicit knowledge retrieved from memory for cross-domain generalization. Experimental results show that our method delivers competitive performance on both laboratory-recorded fiber-optic ESC-50 datasets and a real-world fiber-optic gunshot-firework dataset. Our research also provides valuable insights for other downstream acoustic recognition tasks. The code and gunshot-firework dataset are available at https://github.com/Jingchensun/clap-s.
中文:CLAP-S方法通过结合微调与支持集知识,成功将对比语言-音频预训练模型应用于光纤声学识别领域,在专业数据集上展现出卓越性能。
English: The proposed CLAP-S method effectively adapts Contrastive Language-Audio Pretraining models to fiber-optic acoustic recognition by combining fine-tuning with support set knowledge, demonstrating strong performance across specialized datasets.

Authors:Wanqi Yin, Zhongang Cai, Ruisi Wang, Ailing Zeng, Chen Wei, Qingping Sun, Haiyi Mei, Yanjun Wang, Hui En Pang, Mingyuan Zhang, Lei Zhang, Chen Change Loy, Atsushi Yamashita, Lei Yang, Ziwei Liu
Title: SMPLest-X: Ultimate Scaling for Expressive Human Pose and Shape Estimation
Abstract:
Expressive human pose and shape estimation (EHPS) unifies body, hands, and face motion capture with numerous applications. Despite encouraging progress, current state-of-the-art methods focus on training innovative architectural designs on confined datasets. In this work, we investigate the impact of scaling up EHPS towards a family of generalist foundation models. 1) For data scaling, we perform a systematic investigation on 40 EHPS datasets, encompassing a wide range of scenarios that a model trained on any single dataset cannot handle. More importantly, capitalizing on insights obtained from the extensive benchmarking process, we optimize our training scheme and select datasets that lead to a significant leap in EHPS capabilities. Ultimately, we achieve diminishing returns at 10M training instances from diverse data sources. 2) For model scaling, we take advantage of vision transformers (up to ViT-Huge as the backbone) to study the scaling law of model sizes in EHPS. To exclude the influence of algorithmic design, we base our experiments on two minimalist architectures: SMPLer-X, which consists of an intermediate step for hand and face localization, and SMPLest-X, an even simpler version that reduces the network to its bare essentials and highlights significant advances in the capture of articulated hands. With big data and the large model, the foundation models exhibit strong performance across diverse test benchmarks and excellent transferability to even unseen environments. Moreover, our finetuning strategy turns the generalist into specialist models, allowing them to achieve further performance boosts. Notably, our foundation models consistently deliver state-of-the-art results on seven benchmarks such as AGORA, UBody, EgoBody, and our proposed SynHand dataset for comprehensive hand evaluation. (Code is available at: https://github.com/wqyin/SMPLest-X).
中文摘要:本研究通过整合40个数据集扩展训练数据并采用视觉变换器进行模型升级,建立了表达性人体姿态与形状估计的基础模型,在多个基准测试中实现最优性能且展现出卓越的迁移能力。
English Summary: This research scales up expressive human pose and shape estimation through data expansion across 40 datasets and model scaling using vision transformers, establishing foundation models that achieve state-of-the-art performance across multiple benchmarks while demonstrating strong transferability.

Authors:Zilyu Ji, Yuntian Shen, Jionghao Lin, Kenneth R. Koedinger
Title: Enhancing the De-identification of Personally Identifiable Information in Educational Data
Abstract:
Protecting Personally Identifiable Information (PII), such as names, is a critical requirement in learning technologies to safeguard student and teacher privacy and maintain trust. Accurate PII detection is an essential step toward anonymizing sensitive information while preserving the utility of educational data. Motivated by recent advancements in artificial intelligence, our study investigates the GPT-4o-mini model as a cost-effective and efficient solution for PII detection tasks. We explore both prompting and fine-tuning approaches and compare GPT-4o-mini's performance against established frameworks, including Microsoft Presidio and Azure AI Language. Our evaluation on two public datasets, CRAPII and TSCC, demonstrates that the fine-tuned GPT-4o-mini model achieves superior performance, with a recall of 0.9589 on CRAPII. Additionally, fine-tuned GPT-4o-mini significantly improves precision scores (a threefold increase) while reducing computational costs to nearly one-tenth of those associated with Azure AI Language. Furthermore, our bias analysis reveals that the fine-tuned GPT-4o-mini model consistently delivers accurate results across diverse cultural backgrounds and genders. The generalizability analysis using the TSCC dataset further highlights its robustness, achieving a recall of 0.9895 with minimal additional training data from TSCC. These results emphasize the potential of fine-tuned GPT-4o-mini as an accurate and cost-effective tool for PII detection in educational data. It offers robust privacy protection while preserving the data's utility for research and pedagogical analysis. Our code is available on GitHub: https://github.com/AnonJD/PrivacyAI
中文: 本研究表明,经过微调的GPT-4o-mini模型在个人身份信息检测任务中优于现有框架,不仅显著提升了召回率与精确度,还将计算成本降至十分之一,同时在不同文化背景和性别群体中保持稳定性能。
English: This study demonstrates that fine-tuned GPT-4o-mini outperforms existing frameworks in PII detection, achieving superior recall and precision while significantly reducing computational costs and maintaining accuracy across diverse demographics.

Authors:Yuexi Du, Jiazhen Zhang, Tal Zeevi, Nicha C. Dvornek, John A. Onofrey
Title: SRE-Conv: Symmetric Rotation Equivariant Convolution for Biomedical Image Classification
Abstract:
Convolutional neural networks (CNNs) are essential tools for computer vision tasks, but they lack traditionally desired properties of extracted features that could further improve model performance, e.g., rotational equivariance. Such properties are ubiquitous in biomedical images, which often lack explicit orientation. While current work largely relies on data augmentation or explicit modules to capture orientation information, this comes at the expense of increased training costs or ineffective approximations of the desired equivariance. To overcome these challenges, we propose a novel and efficient implementation of the Symmetric Rotation-Equivariant (SRE) Convolution (SRE-Conv) kernel, designed to learn rotation-invariant features while simultaneously compressing the model size. The SRE-Conv kernel can easily be incorporated into any CNN backbone. We validate the ability of a deep SRE-CNN to capture equivariance to rotation using the public MedMNISTv2 dataset (16 total tasks). SRE-Conv-CNN demonstrated improved rotated image classification performance accuracy on all 16 test datasets in both 2D and 3D images, all while increasing efficiency with fewer parameters and reduced memory footprint. The code is available at https://github.com/XYPB/SRE-Conv.
中文摘要:本研究提出了一种对称旋转等变(SRE)卷积核,能在学习旋转不变特征的同时压缩模型规模,经生物医学图像数据集验证,其在提升分类精度的同时显著提高了计算效率。
English Summary: The study introduces a Symmetric Rotation-Equivariant (SRE) Convolution kernel that enhances CNN performance by learning rotation-invariant features while reducing model size, validated through improved accuracy and efficiency on biomedical image datasets.

Authors:Zekun Xi, Wenbiao Yin, Jizhan Fang, Jialong Wu, Runnan Fang, Jiang Yong, Pengjun Xie, Fei Huang, Huajun Chen, Ningyu Zhang
Title: OmniThink: Expanding Knowledge Boundaries in Machine Writing through Thinking
Abstract:
Machine writing with large language models often relies on retrieval-augmented generation. However, these approaches remain confined within the boundaries of the model's predefined scope, limiting the generation of content with rich information. Specifically, vanilla-retrieved information tends to lack depth, novelty, and suffers from redundancy, which negatively impacts the quality of generated articles, leading to shallow, unoriginal, and repetitive outputs. To address these issues, we propose OmniThink, a slow-thinking machine writing framework that emulates the human-like process of iterative expansion and reflection. The core idea behind OmniThink is to simulate the cognitive behavior of learners as they slowly deepen their knowledge of the topics. Experimental results demonstrate that OmniThink improves the knowledge density of generated articles without compromising metrics such as coherence and depth. Human evaluations and expert feedback further highlight the potential of OmniThink to address real-world challenges in the generation of long-form articles. Code is available at https://github.com/zjunlp/OmniThink.
中文摘要:OmniThink是一种慢思考机器写作框架,通过模拟人类迭代学习过程来克服检索增强生成的局限性,能在保持连贯性和深度的同时提高生成文章的知识密度。
English Summary: OmniThink is a slow-thinking machine writing framework designed to overcome the limitations of retrieval-augmented generation by mimicking human iterative learning, resulting in articles with higher knowledge density while maintaining coherence and depth.

Authors:Zekun Xi, Wenbiao Yin, Jizhan Fang, Jialong Wu, Runnan Fang, Yong Jiang, Pengjun Xie, Fei Huang, Huajun Chen, Ningyu Zhang
Title: OmniThink: Expanding Knowledge Boundaries in Machine Writing through Thinking
Abstract:
Machine writing with large language models often relies on retrieval-augmented generation. However, these approaches remain confined within the boundaries of the model's predefined scope, limiting the generation of content with rich information. Specifically, vanilla-retrieved information tends to lack depth, novelty, and suffers from redundancy, which negatively impacts the quality of generated articles, leading to shallow, unoriginal, and repetitive outputs. To address these issues, we propose OmniThink, a slow-thinking machine writing framework that emulates the human-like process of iterative expansion and reflection. The core idea behind OmniThink is to simulate the cognitive behavior of learners as they slowly deepen their knowledge of the topics. Experimental results demonstrate that OmniThink improves the knowledge density of generated articles without compromising metrics such as coherence and depth. Human evaluations and expert feedback further highlight the potential of OmniThink to address real-world challenges in the generation of long-form articles. Code is available at https://github.com/zjunlp/OmniThink.
中文摘要:OmniThink是一种慢思考机器写作框架,通过模拟人类迭代学习过程来克服检索增强生成的局限性,能在保持连贯性和深度的同时提高生成文章的知识密度。
English Summary: OmniThink is a slow-thinking machine writing framework designed to overcome the limitations of retrieval-augmented generation by mimicking human iterative learning, resulting in articles with higher knowledge density while maintaining coherence and depth.

Authors:Qingyun Li, Yushi Chen, Xinya Shu, Dong Chen, Xin He, Yi Yu, Xue Yang
Title: A Simple Aerial Detection Baseline of Multimodal Language Models
Abstract:
The multimodal language models (MLMs) based on generative pre-trained Transformer are considered powerful candidates for unifying various domains and tasks. MLMs developed for remote sensing (RS) have demonstrated outstanding performance in multiple tasks, such as visual question answering and visual grounding. In addition to visual grounding that detects specific objects corresponded to given instruction, aerial detection, which detects all objects of multiple categories, is also a valuable and challenging task for RS foundation models. However, aerial detection has not been explored by existing RS MLMs because the autoregressive prediction mechanism of MLMs differs significantly from the detection outputs. In this paper, we present a simple baseline for applying MLMs to aerial detection for the first time, named LMMRotate. Specifically, we first introduce a normalization method to transform detection outputs into textual outputs to be compatible with the MLM framework. Then, we propose a evaluation method, which ensures a fair comparison between MLMs and conventional object detection models. We construct the baseline by fine-tuning open-source general-purpose MLMs and achieve impressive detection performance comparable to conventional detector. We hope that this baseline will serve as a reference for future MLM development, enabling more comprehensive capabilities for understanding RS images. Code is available at https://github.com/Li-Qingyun/mllm-mmrotate.
中文:本文提出了LMMRotate这一创新基线,通过将检测输出标准化为文本并实现与传统检测器的公平比较,首次将多模态语言模型应用于航空检测任务,取得了具有竞争力的性能。
English: This paper introduces LMMRotate, a novel baseline that adapts multimodal language models for aerial detection by normalizing detection outputs into text and enabling fair comparison with conventional detectors, achieving competitive performance.

Authors:Juan C. Benito, Daniel Feijoo, Alvaro Garcia, Marcos V. Conde
Title: FLOL: Fast Baselines for Real-World Low-Light Enhancement
Abstract:
Low-Light Image Enhancement (LLIE) is a key task in computational photography and imaging. The problem of enhancing images captured during night or in dark environments has been well-studied in the image signal processing literature. However, current deep learning-based solutions struggle with efficiency and robustness in real-world scenarios (e.g. scenes with noise, saturated pixels, bad illumination). We propose a lightweight neural network that combines image processing in the frequency and spatial domains. Our method, FLOL+, is one of the fastest models for this task, achieving state-of-the-art results on popular real scenes datasets such as LOL and LSRW. Moreover, we are able to process 1080p images under 12ms. Code and models at https://github.com/cidautai/FLOL
中文:提出的FLOL+模型是一种轻量级神经网络,结合频域和空间域进行图像处理,在真实场景数据集上取得领先性能,同时成为最快的方法之一,可在12毫秒内处理1080p图像。
English: The proposed FLOL+ model is a lightweight neural network that processes images in both frequency and spatial domains, achieving state-of-the-art results on real-world datasets while being one of the fastest methods capable of processing 1080p images in under 12ms.

Authors:Hongbo Zhao, Fei Zhu, Bolin Ni, Feng Zhu, Gaofeng Meng, Zhaoxiang Zhang
Title: Practical Continual Forgetting for Pre-trained Vision Models
Abstract:
For privacy and security concerns, the need to erase unwanted information from pre-trained vision models is becoming evident nowadays. In real-world scenarios, erasure requests originate at any time from both users and model owners, and these requests usually form a sequence. Therefore, under such a setting, selective information is expected to be continuously removed from a pre-trained model while maintaining the rest. We define this problem as continual forgetting and identify three key challenges. (i) For unwanted knowledge, efficient and effective deleting is crucial. (ii) For remaining knowledge, the impact brought by the forgetting procedure should be minimal. (iii) In real-world scenarios, the training samples may be scarce or partially missing during the process of forgetting. To address them, we first propose Group Sparse LoRA (GS-LoRA). Specifically, towards (i), we introduce LoRA modules to fine-tune the FFN layers in Transformer blocks for each forgetting task independently, and towards (ii), a simple group sparse regularization is adopted, enabling automatic selection of specific LoRA groups and zeroing out the others. To further extend GS-LoRA to more practical scenarios, we incorporate prototype information as additional supervision and introduce a more practical approach, GS-LoRA++. For each forgotten class, we move the logits away from its original prototype. For the remaining classes, we pull the logits closer to their respective prototypes. We conduct extensive experiments on face recognition, object detection and image classification and demonstrate that our method manages to forget specific classes with minimal impact on other classes. Codes have been released on https://github.com/bjzhb666/GS-LoRA.
中文: 针对预训练视觉模型需持续删除特定信息的需求,本文提出GS-LoRA及其增强版GS-LoRA++,通过任务独立的LoRA模块结合组稀疏正则化和原型监督机制,在有效擦除目标知识的同时最大程度保持模型其余性能。
English: To address the need for continuous removal of specific information from pre-trained vision models while preserving overall performance, this paper introduces GS-LoRA and its enhanced version GS-LoRA++, which utilize task-specific LoRA modules with group sparse regularization and prototype supervision to effectively erase targeted knowledge with minimal impact on retained capabilities.

Authors:Zhihe Yang, Xufang Luo, Dongqi Han, Yunjian Xu, Dongsheng Li
Title: Mitigating Hallucinations in Large Vision-Language Models via DPO: On-Policy Data Hold the Key
Abstract:
Hallucination remains a major challenge for Large Vision-Language Models (LVLMs). Direct Preference Optimization (DPO) has gained increasing attention as a simple solution to hallucination issues. It directly learns from constructed preference pairs that reflect the severity of hallucinations in responses to the same prompt and image. Nonetheless, different data construction methods in existing works bring notable performance variations. We identify a crucial factor here: outcomes are largely contingent on whether the constructed data aligns on-policy w.r.t the initial (reference) policy of DPO. Theoretical analysis suggests that learning from off-policy data is impeded by the presence of KL-divergence between the updated policy and the reference policy. From the perspective of dataset distribution, we systematically summarize the inherent flaws in existing algorithms that employ DPO to address hallucination issues. To alleviate the problems, we propose On-Policy Alignment (OPA)-DPO framework, which uniquely leverages expert feedback to correct hallucinated responses and aligns both the original and expert-revised responses in an on-policy manner. Notably, with only 4.8k data, OPA-DPO achieves an additional reduction in the hallucination rate of LLaVA-1.5-7B: 13.26% on the AMBER benchmark and 5.39% on the Object-Hal benchmark, compared to the previous SOTA algorithm trained with 16k samples. Our implementation is available at https://github.com/zhyang2226/OPA-DPO.
中文: OPA-DPO框架通过采用基于专家反馈的在线策略对齐方法,有效减少大型视觉语言模型的幻觉问题,仅用少量数据即在多个基准测试中显著超越现有最优算法的性能表现。
English: The OPA-DPO framework addresses hallucination in LVLMs by using on-policy alignment with expert feedback, achieving significant reductions in hallucination rates with minimal data compared to existing methods.

Authors:Masatoshi Uehara, Yulai Zhao, Chenyu Wang, Xiner Li, Aviv Regev, Sergey Levine, Tommaso Biancalani
Title: Inference-Time Alignment in Diffusion Models with Reward-Guided Generation: Tutorial and Review
Abstract:
This tutorial provides an in-depth guide on inference-time guidance and alignment methods for optimizing downstream reward functions in diffusion models. While diffusion models are renowned for their generative modeling capabilities, practical applications in fields such as biology often require sample generation that maximizes specific metrics (e.g., stability, affinity in proteins, closeness to target structures). In these scenarios, diffusion models can be adapted not only to generate realistic samples but also to explicitly maximize desired measures at inference time without fine-tuning. This tutorial explores the foundational aspects of such inference-time algorithms. We review these methods from a unified perspective, demonstrating that current techniques -- such as Sequential Monte Carlo (SMC)-based guidance, value-based sampling, and classifier guidance -- aim to approximate soft optimal denoising processes (a.k.a. policies in RL) that combine pre-trained denoising processes with value functions serving as look-ahead functions that predict from intermediate states to terminal rewards. Within this framework, we present several novel algorithms not yet covered in the literature. Furthermore, we discuss (1) fine-tuning methods combined with inference-time techniques, (2) inference-time algorithms based on search algorithms such as Monte Carlo tree search, which have received limited attention in current research, and (3) connections between inference-time algorithms in language models and diffusion models. The code of this tutorial on protein design is available at https://github.com/masa-ue/AlignInversePro
本教程深入探讨了扩散模型中基于推理时引导与对齐的方法来优化下游奖励函数,提出了新颖算法并建立了与语言模型及搜索方法的理论关联。
This tutorial offers a comprehensive overview of inference-time guidance and alignment techniques for optimizing reward functions in diffusion models, introducing novel algorithms and exploring connections with language models and search-based methods.

Authors:Tingxuan Chen, Kun Yuan, Vinkle Srivastav, Nassir Navab, Nicolas Padoy
Title: Text-driven Adaptation of Foundation Models for Few-shot Surgical Workflow Analysis
Abstract:
Purpose: Surgical workflow analysis is crucial for improving surgical efficiency and safety. However, previous studies rely heavily on large-scale annotated datasets, posing challenges in cost, scalability, and reliance on expert annotations. To address this, we propose Surg-FTDA (Few-shot Text-driven Adaptation), designed to handle various surgical workflow analysis tasks with minimal paired image-label data. Methods: Our approach has two key components. First, Few-shot selection-based modality alignment selects a small subset of images and aligns their embeddings with text embeddings from the downstream task, bridging the modality gap. Second, Text-driven adaptation leverages only text data to train a decoder, eliminating the need for paired image-text data. This decoder is then applied to aligned image embeddings, enabling image-related tasks without explicit image-text pairs. Results: We evaluate our approach to generative tasks (image captioning) and discriminative tasks (triplet recognition and phase recognition). Results show that Surg-FTDA outperforms baselines and generalizes well across downstream tasks. Conclusion: We propose a text-driven adaptation approach that mitigates the modality gap and handles multiple downstream tasks in surgical workflow analysis, with minimal reliance on large annotated datasets. The code and dataset will be released in https://github.com/CAMMA-public/Surg-FTDA
Chinese: Surg-FTDA提出了一种文本驱动的适应方法,通过弥合模态差距,仅需少量标注数据即可执行多种手术流程分析任务,在生成式和判别式任务评估中均优于基线模型。
English: Surg-FTDA introduces a text-driven adaptation method that bridges the modality gap and performs multiple surgical workflow analysis tasks with minimal annotated data, outperforming baselines in generative and discriminative evaluations.

Authors:Hanrong Zhang, Yifei Yao, Zixuan Wang, Jiayuan Su, Mengxuan Li, Peng Peng, Hongwei Wang
Title: Class Incremental Fault Diagnosis under Limited Fault Data via Supervised Contrastive Knowledge Distillation
Abstract:
Class-incremental fault diagnosis requires a model to adapt to new fault classes while retaining previous knowledge. However, limited research exists for imbalanced and long-tailed data. Extracting discriminative features from few-shot fault data is challenging, and adding new fault classes often demands costly model retraining. Moreover, incremental training of existing methods risks catastrophic forgetting, and severe class imbalance can bias the model's decisions toward normal classes. To tackle these issues, we introduce a Supervised Contrastive knowledge distiLlation for class Incremental Fault Diagnosis (SCLIFD) framework proposing supervised contrastive knowledge distillation for improved representation learning capability and less forgetting, a novel prioritized exemplar selection method for sample replay to alleviate catastrophic forgetting, and the Random Forest Classifier to address the class imbalance. Extensive experimentation on simulated and real-world industrial datasets across various imbalance ratios demonstrates the superiority of SCLIFD over existing approaches. Our code can be found at https://github.com/Zhang-Henry/SCLIFD_TII.
中文:SCLIFD框架通过结合监督对比知识蒸馏以增强特征学习并减少遗忘、优先样本回放方法以及随机森林分类器来缓解类别不平衡,有效解决了类别增量故障诊断中的关键难题,并在不平衡数据集上展现出卓越性能。
English: The SCLIFD framework addresses class-incremental fault diagnosis challenges by integrating supervised contrastive knowledge distillation to enhance feature learning and reduce forgetting, a prioritized exemplar selection method for sample replay, and a Random Forest Classifier to mitigate class imbalance, demonstrating superior performance on imbalanced datasets.

Authors:Zhaocheng Liu, Quan Tu, Wen Ye, Yu Xiao, Zhishou Zhang, Hengfu Cui, Yalun Zhu, Qiang Ju, Shizheng Li, Jian Xie
Title: Exploring the Inquiry-Diagnosis Relationship with Advanced Patient Simulators
Abstract:
Recently, large language models have shown great potential to transform online medical consultation. Despite this, most research targets improving diagnostic accuracy with ample information, often overlooking the inquiry phase. Some studies try to evaluate or refine doctor models by using prompt-engineered patient agents. However, prompt engineering alone falls short in accurately simulating real patients. We need to explore new paradigms for patient simulation. Furthermore, the relationship between inquiry and diagnosis remains unexplored. This paper extracts dialogue strategies from real doctor-patient conversations to guide the training of a patient simulator. Our simulator shows higher anthropomorphism and lower hallucination rates, using dynamic dialogue strategies. This innovation offers a more accurate evaluation of diagnostic models and generates realistic synthetic data. We conduct extensive experiments on the relationship between inquiry and diagnosis, showing they adhere to Liebig's law: poor inquiry limits diagnosis effectiveness, regardless of diagnostic skill, and vice versa. The experiments also reveal substantial differences in inquiry performance among models. To delve into this phenomenon, the inquiry process is categorized into four distinct types. Analyzing the distribution of inquiries across these types helps explain the performance differences. The weights of our patient simulator are available https://github.com/PatientSimulator/PatientSimulator.
中文摘要:本文通过提取真实医患对话的策略训练患者模拟器,提升了拟人化程度并降低了幻觉率,从而更准确地评估诊断模型和生成合成数据,同时基于李比希定律揭示了问诊与诊断的相互制约关系。
English Summary: This paper introduces a patient simulator trained with dialogue strategies from real doctor-patient conversations, enhancing anthropomorphism and reducing hallucinations to better evaluate diagnostic models and generate synthetic data, while also exploring the critical relationship between inquiry and diagnosis under Liebig's law.

Authors:Jan Skvrna, Lukas Neumann
Title: MonoSOWA: Scalable monocular 3D Object detector Without human Annotations
Abstract:
Inferring object 3D position and orientation from a single RGB camera is a foundational task in computer vision with many important applications. Traditionally, 3D object detection methods are trained in a fully-supervised setup, requiring LiDAR and vast amounts of human annotations, which are laborious, costly, and do not scale well with the ever-increasing amounts of data being captured. We present a novel method to train a 3D object detector from a single RGB camera without domain-specific human annotations, making orders of magnitude more data available for training. The method uses newly proposed Local Object Motion Model to disentangle object movement source between subsequent frames, is approximately 700 times faster than previous work and compensates camera focal length differences to aggregate multiple datasets. The method is evaluated on three public datasets, where despite using no human labels, it outperforms prior work by a significant margin. It also shows its versatility as a pre-training tool for fully-supervised training and shows that combining pseudo-labels from multiple datasets can achieve comparable accuracy to using human labels from a single dataset. The source code and model are available at https://github.com/jskvrna/MonoSOWA.
中文: 本文提出了一种仅使用单目RGB相机图像且无需人工标注来训练3D物体检测器的新方法,在性能和效率上均显著优于现有技术。
English: This paper introduces a novel method for training a 3D object detector using only single RGB camera images without human annotations, achieving superior performance and efficiency over previous approaches.

Authors:Ji Shi, Xianghua Ying, Ruohao Guo, Bowei Xing, Wenzhen Yue
Title: Normal-NeRF: Ambiguity-Robust Normal Estimation for Highly Reflective Scenes
Abstract:
Neural Radiance Fields (NeRF) often struggle with reconstructing and rendering highly reflective scenes. Recent advancements have developed various reflection-aware appearance models to enhance NeRF's capability to render specular reflections. However, the robust reconstruction of highly reflective scenes is still hindered by the inherent shape ambiguity on specular surfaces. Existing methods typically rely on additional geometry priors to regularize the shape prediction, but this can lead to oversmoothed geometry in complex scenes. Observing the critical role of surface normals in parameterizing reflections, we introduce a transmittance-gradient-based normal estimation technique that remains robust even under ambiguous shape conditions. Furthermore, we propose a dual activated densities module that effectively bridges the gap between smooth surface normals and sharp object boundaries. Combined with a reflection-aware appearance model, our proposed method achieves robust reconstruction and high-fidelity rendering of scenes featuring both highly specular reflections and intricate geometric structures. Extensive experiments demonstrate that our method outperforms existing state-of-the-art methods on various datasets.
Chinese: 该方法通过引入基于透射率梯度的法线估计技术和双重激活密度模块,显著提升了神经辐射场(NeRF)对高反射场景的重建鲁棒性和渲染保真度,有效处理复杂几何结构下的镜面反射问题。
English: The proposed method enhances Neural Radiance Fields (NeRF) by introducing a transmittance-gradient-based normal estimation technique and a dual activated densities module, enabling robust reconstruction and high-fidelity rendering of highly reflective scenes with complex geometries.

Authors:Tobias Fiedler, Leon Hermann, Florian Müller, Sarel Cohen, Peter Chin, Tobias Friedrich, Eilon Vaadia
Title: Teaching Wav2Vec2 the Language of the Brain
Abstract:
The decoding of continuously spoken speech from neuronal activity has the potential to become an important clinical solution for paralyzed patients. Deep Learning Brain Computer Interfaces (BCIs) have recently successfully mapped neuronal activity to text contents in subjects who attempted to formulate speech. However, only small BCI datasets are available. In contrast, labeled data and pre-trained models for the closely related task of speech recognition from audio are widely available. One such model is Wav2Vec2 which has been trained in a self-supervised fashion to create meaningful representations of speech audio data. In this study, we show that patterns learned by Wav2Vec2 are transferable to brain data. Specifically, we replace its audio feature extractor with an untrained Brain Feature Extractor (BFE) model. We then execute full fine-tuning with pre-trained weights for Wav2Vec2, training ''from scratch'' without pre-trained weights as well as freezing a pre-trained Wav2Vec2 and training only the BFE each for 45 different BFE architectures. Across these experiments, the best run is from full fine-tuning with pre-trained weights, achieving a Character Error Rate (CER) of 18.54\%, outperforming the best training from scratch run by 20.46\% and that of frozen Wav2Vec2 training by 15.92\% percentage points. These results indicate that knowledge transfer from audio speech recognition to brain decoding is possible and significantly improves brain decoding performance for the same architectures. Related source code is available at https://github.com/tfiedlerdev/Wav2Vec2ForBrain.
中文: 本研究证明将预训练的Wav2Vec2模型从音频语音识别迁移至脑机接口能显著提升大脑语音解码性能,通过完全微调实现了18.54%的字符错误率。
English: This study demonstrates that transferring pre-trained Wav2Vec2 models from audio speech recognition to brain-computer interfaces significantly enhances speech decoding performance, achieving a character error rate of 18.54% through full fine-tuning.

Authors:Tim J. M. Jaspers, Ronald L. P. D. de Jong, Yiping Li, Carolus H. J. Kusters, Franciscus H. A. Bakker, Romy C. van Jaarsveld, Gino M. Kuiper, Richard van Hillegersberg, Jelle P. Ruurda, Willem M. Brinkman, Josien P. W. Pluim, Peter H. N. de With, Marcel Breeuwer, Yasmina Al Khalil, Fons van der Sommen
Title: Scaling up self-supervised learning for improved surgical foundation models
Abstract:
Foundation models have revolutionized computer vision by achieving vastly superior performance across diverse tasks through large-scale pretraining on extensive datasets. However, their application in surgical computer vision has been limited. This study addresses this gap by introducing SurgeNetXL, a novel surgical foundation model that sets a new benchmark in surgical computer vision. Trained on the largest reported surgical dataset to date, comprising over 4.7 million video frames, SurgeNetXL achieves consistent top-tier performance across six datasets spanning four surgical procedures and three tasks, including semantic segmentation, phase recognition, and critical view of safety (CVS) classification. Compared with the best-performing surgical foundation models, SurgeNetXL shows mean improvements of 2.4, 9.0, and 12.6 percent for semantic segmentation, phase recognition, and CVS classification, respectively. Additionally, SurgeNetXL outperforms the best-performing ImageNet-based variants by 14.4, 4.0, and 1.6 percent in the respective tasks. In addition to advancing model performance, this study provides key insights into scaling pretraining datasets, extending training durations, and optimizing model architectures specifically for surgical computer vision. These findings pave the way for improved generalizability and robustness in data-scarce scenarios, offering a comprehensive framework for future research in this domain. All models and a subset of the SurgeNetXL dataset, including over 2 million video frames, are publicly available at: https://github.com/TimJaspers0801/SurgeNet.
中文:SurgeNetXL作为突破性的手术基础模型,基于470万视频帧训练,在多项手术任务和流程中实现顶尖性能,显著超越现有模型的同时为手术计算机视觉发展提供了关键见解。
English: SurgeNetXL is a groundbreaking surgical foundation model trained on 4.7 million video frames that achieves state-of-the-art performance across multiple surgical tasks and procedures, significantly outperforming existing models while providing key insights for advancing surgical computer vision.

Authors:Veronika Spieker, Hannah Eichhorn, Wenqi Huang, Jonathan K. Stelter, Tabita Catalan, Rickmer F. Braren, Daniel Rueckert, Francisco Sahli Costabal, Kerstin Hammernik, Dimitrios C. Karampinos, Claudia Prieto, Julia A. Schnabel
Title: PISCO: Self-Supervised k-Space Regularization for Improved Neural Implicit k-Space Representations of Dynamic MRI
Abstract:
Neural implicit k-space representations (NIK) have shown promising results for dynamic magnetic resonance imaging (MRI) at high temporal resolutions. Yet, reducing acquisition time, and thereby available training data, results in severe performance drops due to overfitting. To address this, we introduce a novel self-supervised k-space loss function $\mathcal{L}_\mathrm{PISCO}$, applicable for regularization of NIK-based reconstructions. The proposed loss function is based on the concept of parallel imaging-inspired self-consistency (PISCO), enforcing a consistent global k-space neighborhood relationship without requiring additional data. Quantitative and qualitative evaluations on static and dynamic MR reconstructions show that integrating PISCO significantly improves NIK representations. Particularly for high acceleration factors (R$\geq$54), NIK with PISCO achieves superior spatio-temporal reconstruction quality compared to state-of-the-art methods. Furthermore, an extensive analysis of the loss assumptions and stability shows PISCO's potential as versatile self-supervised k-space loss function for further applications and architectures. Code is available at: https://github.com/compai-lab/2025-pisco-spieker
中文: 本研究提出了一种自监督k空间损失函数PISCO,通过强制全局邻域一致性且无需额外数据,显著提升了动态MRI中神经隐式k空间表示的性能,在高加速因子下实现了卓越的时空重建质量。
English: The study introduces a self-supervised k-space loss function, PISCO, which enhances neural implicit k-space representations for dynamic MRI by enforcing global neighborhood consistency without extra data, achieving superior reconstruction quality at high accelerations.

Authors:Fen Wang, Bomiao Wang, Xueli Shu, Zhen Liu, Zekai Shao, Chao Liu, Siming Chen
Title: ChartInsighter: An Approach for Mitigating Hallucination in Time-series Chart Summary Generation with A Benchmark Dataset
Abstract:
Effective chart summary can significantly reduce the time and effort decision makers spend interpreting charts, enabling precise and efficient communication of data insights. Previous studies have faced challenges in generating accurate and semantically rich summaries of time-series data charts. In this paper, we identify summary elements and common hallucination types in the generation of time-series chart summaries, which serve as our guidelines for automatic generation. We introduce ChartInsighter, which automatically generates chart summaries of time-series data, effectively reducing hallucinations in chart summary generation. Specifically, we assign multiple agents to generate the initial chart summary and collaborate iteratively, during which they invoke external data analysis modules to extract insights and compile them into a coherent summary. Additionally, we implement a self-consistency test method to validate and correct our summary. We create a high-quality benchmark of charts and summaries, with hallucination types annotated on a sentence-by-sentence basis, facilitating the evaluation of the effectiveness of reducing hallucinations. Our evaluations using our benchmark show that our method surpasses state-of-the-art models, and that our summary hallucination rate is the lowest, which effectively reduces various hallucinations and improves summary quality. The benchmark is available at https://github.com/wangfen01/ChartInsighter.
中文摘要:ChartInsighter通过多智能体协作与自一致性检验方法,有效减少时序图表摘要中的幻觉现象,在新建基准测试中表现优于现有最优模型。
English Summary: ChartInsighter is an automated system that reduces hallucinations in time-series chart summaries through multi-agent collaboration and self-consistency testing, achieving state-of-the-art performance on a newly created benchmark.

Authors:Arpita Chowdhury, Dipanjyoti Paul, Zheda Mai, Jianyang Gu, Ziheng Zhang, Kazi Sajeed Mehrab, Elizabeth G. Campolongo, Daniel Rubenstein, Charles V. Stewart, Anuj Karpatne, Tanya Berger-Wolf, Yu Su, Wei-Lun Chao
Title: Prompt-CAM: Making Vision Transformers Interpretable for Fine-Grained Analysis
Abstract:
We present a simple approach to make pre-trained Vision Transformers (ViTs) interpretable for fine-grained analysis, aiming to identify and localize the traits that distinguish visually similar categories, such as bird species. Pre-trained ViTs, such as DINO, have demonstrated remarkable capabilities in extracting localized, discriminative features. However, saliency maps like Grad-CAM often fail to identify these traits, producing blurred, coarse heatmaps that highlight entire objects instead. We propose a novel approach, Prompt Class Attention Map (Prompt-CAM), to address this limitation. Prompt-CAM learns class-specific prompts for a pre-trained ViT and uses the corresponding outputs for classification. To correctly classify an image, the true-class prompt must attend to unique image patches not present in other classes' images (i.e., traits). As a result, the true class's multi-head attention maps reveal traits and their locations. Implementation-wise, Prompt-CAM is almost a ``free lunch,'' requiring only a modification to the prediction head of Visual Prompt Tuning (VPT). This makes Prompt-CAM easy to train and apply, in stark contrast to other interpretable methods that require designing specific models and training processes. Extensive empirical studies on a dozen datasets from various domains (e.g., birds, fishes, insects, fungi, flowers, food, and cars) validate the superior interpretation capability of Prompt-CAM. The source code and demo are available at https://github.com/Imageomics/Prompt_CAM.
Chinese: 本文提出Prompt-CAM方法,通过生成类别特定的注意力图来提升预训练视觉Transformer的可解释性,能精准识别并定位视觉相似类别间的区分性特征。
English: This paper introduces Prompt-CAM, a simple method that enhances the interpretability of pre-trained Vision Transformers by generating class-specific attention maps to precisely identify and localize distinguishing traits between visually similar categories.

Authors:Yixiao Xu, Binxing Fang, Rui Wang, Yinghai Zhou, Yuan Liu, Mohan Li, Zhihong Tian
Title: Neural Honeytrace: A Robust Plug-and-Play Watermarking Framework against Model Extraction Attacks
Abstract:
Developing high-performance deep learning models is resource-intensive, leading model owners to utilize Machine Learning as a Service (MLaaS) platforms instead of publicly releasing their models. However, malicious users may exploit query interfaces to execute model extraction attacks, reconstructing the target model's functionality locally. While prior research has investigated triggerable watermarking techniques for asserting ownership, existing methods face significant challenges: (1) most approaches require additional training, resulting in high overhead and limited flexibility, and (2) they often fail to account for advanced attackers, leaving them vulnerable to adaptive attacks. In this paper, we propose Neural Honeytrace, a robust plug-and-play watermarking framework against model extraction attacks. We first formulate a watermark transmission model from an information-theoretic perspective, providing an interpretable account of the principles and limitations of existing triggerable watermarking. Guided by the model, we further introduce: (1) a similarity-based training-free watermarking method for plug-and-play and flexible watermarking, and (2) a distribution-based multi-step watermark information transmission strategy for robust watermarking. Comprehensive experiments on four datasets demonstrate that Neural Honeytrace outperforms previous methods in efficiency and resisting adaptive attacks. Neural Honeytrace reduces the average number of samples required for a worst-case t-Test-based copyright claim from 193,252 to 1,857 with zero training cost. The code is available at https://github.com/NeurHT/NeurHT.
中文: 本文提出Neural Honeytrace,一种即插即用的水印框架,通过免训练方法和多步信息传输策略提升抗模型提取攻击的效能与鲁棒性,大幅降低了版权验证所需样本量。
English: This paper introduces Neural Honeytrace, a plug-and-play watermarking framework that enhances efficiency and robustness against model extraction attacks by employing training-free methods and multi-step information transmission, significantly reducing the required samples for copyright verification.

Authors:Zichang Ge, Changyu Chen, Arunesh Sinha, Pradeep Varakantham
Title: On Learning Informative Trajectory Embeddings for Imitation, Classification and Regression
Abstract:
In real-world sequential decision making tasks like autonomous driving, robotics, and healthcare, learning from observed state-action trajectories is critical for tasks like imitation, classification, and clustering. For example, self-driving cars must replicate human driving behaviors, while robots and healthcare systems benefit from modeling decision sequences, whether or not they come from expert data. Existing trajectory encoding methods often focus on specific tasks or rely on reward signals, limiting their ability to generalize across domains and tasks. Inspired by the success of embedding models like CLIP and BERT in static domains, we propose a novel method for embedding state-action trajectories into a latent space that captures the skills and competencies in the dynamic underlying decision-making processes. This method operates without the need for reward labels, enabling better generalization across diverse domains and tasks. Our contributions are threefold: (1) We introduce a trajectory embedding approach that captures multiple abilities from state-action data. (2) The learned embeddings exhibit strong representational power across downstream tasks, including imitation, classification, clustering, and regression. (3) The embeddings demonstrate unique properties, such as controlling agent behaviors in IQ-Learn and an additive structure in the latent space. Experimental results confirm that our method outperforms traditional approaches, offering more flexible and powerful trajectory representations for various applications. Our code is available at https://github.com/Erasmo1015/vte.
中文: 本文提出了一种将状态-行动轨迹嵌入到潜在空间的新方法,以捕捉底层决策技能,无需奖励标签即可实现跨领域泛化,并在模仿和分类等任务中优于传统方法。
English: This paper introduces a novel method for embedding state-action trajectories into a latent space to capture underlying decision-making skills, enabling reward-free generalization across diverse domains and outperforming traditional approaches in tasks like imitation and classification.

Authors:Kyeongha Rho, Hyeongkeun Lee, Valentio Iverson, Joon Son Chung
Title: LAVCap: LLM-based Audio-Visual Captioning using Optimal Transport
Abstract:
Automated audio captioning is a task that generates textual descriptions for audio content, and recent studies have explored using visual information to enhance captioning quality. However, current methods often fail to effectively fuse audio and visual data, missing important semantic cues from each modality. To address this, we introduce LAVCap, a large language model (LLM)-based audio-visual captioning framework that effectively integrates visual information with audio to improve audio captioning performance. LAVCap employs an optimal transport-based alignment loss to bridge the modality gap between audio and visual features, enabling more effective semantic extraction. Additionally, we propose an optimal transport attention module that enhances audio-visual fusion using an optimal transport assignment map. Combined with the optimal training strategy, experimental results demonstrate that each component of our framework is effective. LAVCap outperforms existing state-of-the-art methods on the AudioCaps dataset, without relying on large datasets or post-processing. Code is available at https://github.com/NAVER-INTEL-Co-Lab/gaudi-lavcap.
Chinese: LAVCap是一种基于大语言模型的创新性音视频字幕生成框架,通过最优传输技术有效弥合音频与视觉数据之间的模态差异,在AudioCaps数据集上无需大规模数据或后处理即实现了最先进的性能表现。
English: LAVCap is an innovative LLM-based audio-visual captioning framework that utilizes optimal transport techniques to effectively bridge the modality gap between audio and visual data, achieving state-of-the-art performance on AudioCaps without requiring large datasets or post-processing.

Authors:Haobin Qin, Calvin Yeung, Rikuhei Umemoto, Keisuke Fujii
Title: SoccerSynth-Detection: A Synthetic Dataset for Soccer Player Detection
Abstract:
In soccer video analysis, player detection is essential for identifying key events and reconstructing tactical positions. The presence of numerous players and frequent occlusions, combined with copyright restrictions, severely restricts the availability of datasets, leaving limited options such as SoccerNet-Tracking and SportsMOT. These datasets suffer from a lack of diversity, which hinders algorithms from adapting effectively to varied soccer video contexts. To address these challenges, we developed SoccerSynth-Detection, the first synthetic dataset designed for the detection of synthetic soccer players. It includes a broad range of random lighting and textures, as well as simulated camera motion blur. We validated its efficacy using the object detection model (Yolov8n) against real-world datasets (SoccerNet-Tracking and SportsMoT). In transfer tests, it matched the performance of real datasets and significantly outperformed them in images with motion blur; in pre-training tests, it demonstrated its efficacy as a pre-training dataset, significantly enhancing the algorithm's overall performance. Our work demonstrates the potential of synthetic datasets to replace real datasets for algorithm training in the field of soccer video analysis.
中文:SoccerSynth-Detection作为首个合成足球运动员检测数据集,通过迁移测试中达到真实数据集水平并在动态模糊场景表现更优,同时作为预训练数据显著提升算法性能,有效解决了数据集稀缺和多样性不足的问题。
English: SoccerSynth-Detection, the first synthetic dataset for soccer player detection, effectively addresses dataset scarcity and diversity issues by matching real dataset performance in transfer tests and excelling in motion blur scenarios, while also significantly boosting algorithm performance as a pre-training tool.

Authors:Shuo Chen, Yijin Li, Guofeng Zhang
Title: OpticFusion: Multi-Modal Neural Implicit 3D Reconstruction of Microstructures by Fusing White Light Interferometry and Optical Microscopy
Abstract:
White Light Interferometry (WLI) is a precise optical tool for measuring the 3D topography of microstructures. However, conventional WLI cannot capture the natural color of a sample's surface, which is essential for many microscale research applications that require both 3D geometry and color information. Previous methods have attempted to overcome this limitation by modifying WLI hardware and analysis software, but these solutions are often costly. In this work, we address this challenge from a computer vision multi-modal reconstruction perspective for the first time. We introduce OpticFusion, a novel approach that uses an additional digital optical microscope (OM) to achieve 3D reconstruction with natural color textures using multi-view WLI and OM images. Our method employs a two-step data association process to obtain the poses of WLI and OM data. By leveraging the neural implicit representation, we fuse multi-modal data and apply color decomposition technology to extract the sample's natural color. Tested on our multi-modal dataset of various microscale samples, OpticFusion achieves detailed 3D reconstructions with color textures. Our method provides an effective tool for practical applications across numerous microscale research fields. The source code and our real-world dataset are available at https://github.com/zju3dv/OpticFusion.
中文: 本研究提出了OpticFusion这一创新计算机视觉方法,通过结合白光干涉仪和光学显微镜,实现了微结构的三维重建并保留其自然色彩,为微观尺度研究提供了一种经济有效的解决方案。
English: This study introduces OpticFusion, a novel computer vision method that integrates White Light Interferometry (WLI) with an optical microscope (OM) to enable precise 3D reconstructions of microstructures while capturing their natural color, offering a cost-effective solution for microscale research applications.

Authors:Edward R Criscuolo, Yao Hao, Zhendong Zhang, Trevor McKeown, Deshan Yang
Title: A Vessel Bifurcation Landmark Pair Dataset for Abdominal CT Deformable Image Registration (DIR) Validation
Abstract:
Deformable image registration (DIR) is an enabling technology in many diagnostic and therapeutic tasks. Despite this, DIR algorithms have limited clinical use, largely due to a lack of benchmark datasets for quality assurance during development. To support future algorithm development, here we introduce our first-of-its-kind abdominal CT DIR benchmark dataset, comprising large numbers of highly accurate landmark pairs on matching blood vessel bifurcations. Abdominal CT image pairs of 30 patients were acquired from several public repositories as well as the authors' institution with IRB approval. The two CTs of each pair were originally acquired for the same patient on different days. An image processing workflow was developed and applied to each image pair: 1) Abdominal organs were segmented with a deep learning model, and image intensity within organ masks was overwritten. 2) Matching image patches were manually identified between two CTs of each image pair 3) Vessel bifurcation landmarks were labeled on one image of each image patch pair. 4) Image patches were deformably registered, and landmarks were projected onto the second image. 5) Landmark pair locations were refined manually or with an automated process. This workflow resulted in 1895 total landmark pairs, or 63 per case on average. Estimates of the landmark pair accuracy using digital phantoms were 0.7+/-1.2mm. The data is published in Zenodo at https://doi.org/10.5281/zenodo.14362785. Instructions for use can be found at https://github.com/deshanyang/Abdominal-DIR-QA. This dataset is a first-of-its-kind for abdominal DIR validation. The number, accuracy, and distribution of landmark pairs will allow for robust validation of DIR algorithms with precision beyond what is currently available.
中文: 本研究首次推出腹部CT可变形图像配准基准数据集,包含1,895对高精度标志点(精度达0.7±1.2毫米),能为DIR算法提供超越现有水平的精准验证。
English: This study introduces the first abdominal CT deformable image registration benchmark dataset, featuring 1,895 highly accurate landmark pairs with an estimated accuracy of 0.7±1.2mm, enabling robust validation of DIR algorithms beyond current capabilities.

Authors:Eshaan Tanwar, Gayatri Oke, Tanmoy Chakraborty
Title: Multilingual LLMs Struggle to Link Orthography and Semantics in Bilingual Word Processing
Abstract:
Bilingual lexical processing is shaped by the complex interplay of phonological, orthographic, and semantic features of two languages within an integrated mental lexicon. In humans, this is evident in the ease with which cognate words - words similar in both orthographic form and meaning (e.g., blind, meaning "sightless" in both English and German) - are processed, compared to the challenges posed by interlingual homographs, which share orthographic form but differ in meaning (e.g., gift, meaning "present" in English but "poison" in German). We investigate how multilingual Large Language Models (LLMs) handle such phenomena, focusing on English-Spanish, English-French, and English-German cognates, non-cognate, and interlingual homographs. Specifically, we evaluate their ability to disambiguate meanings and make semantic judgments, both when these word types are presented in isolation or within sentence contexts. Our findings reveal that while certain LLMs demonstrate strong performance in recognizing cognates and non-cognates in isolation, they exhibit significant difficulty in disambiguating interlingual homographs, often performing below random baselines. This suggests LLMs tend to rely heavily on orthographic similarities rather than semantic understanding when interpreting interlingual homographs. Further, we find LLMs exhibit difficulty in retrieving word meanings, with performance in isolative disambiguation tasks having no correlation with semantic understanding. Finally, we study how the LLM processes interlingual homographs in incongruent sentences. We find models to opt for different strategies in understanding English and non-English homographs, highlighting a lack of a unified approach to handling cross-lingual ambiguities.
中文: 多语言大语言模型在识别同源词和非同源词方面表现良好,但在区分跨语言同形异义词时存在显著困难,往往依赖拼写相似性而非语义理解,且缺乏处理跨语言歧义的一致性策略。
English: Multilingual Large Language Models (LLMs) show proficiency in recognizing cognates and non-cognates but struggle significantly with disambiguating interlingual homographs, often relying on orthographic cues over semantic understanding and lacking a consistent strategy for cross-lingual ambiguities.

Authors:Suhail Basalama, Jason Cong
Title: Stream-HLS: Towards Automatic Dataflow Acceleration
Abstract:
High-level synthesis (HLS) has enabled the rapid development of custom hardware circuits for many software applications. However, developing high-performance hardware circuits using HLS is still a non-trivial task requiring expertise in hardware design. Further, the hardware design space, especially for multi-kernel applications, grows exponentially. Therefore, several HLS automation and abstraction frameworks have been proposed recently, but many issues remain unresolved. These issues include: 1) relying mainly on hardware directives (pragmas) to apply hardware optimizations without exploring loop scheduling opportunities. 2) targeting single-kernel applications only. 3) lacking automatic and/or global design space exploration. 4) missing critical hardware optimizations, such as graph-level pipelining for multi-kernel applications. To address these challenges, we propose a novel methodology and framework on top of the popular multi-level intermediate representation (MLIR) infrastructure called Stream-HLS. Our framework takes a C/C++ or PyTorch software code and automatically generates an optimized dataflow architecture along with host code for field-programmable gate arrays (FPGAs). To achieve this, we developed an accurate analytical performance model for global scheduling and optimization of dataflow architectures. Stream-HLS is evaluated using various standard HLS benchmarks and real-world benchmarks from transformer models, convolution neural networks, and multilayer perceptrons. Stream-HLS designs outperform the designs of prior state-of-the-art automation frameworks and manually-optimized designs of abstraction frameworks by up to $79.43\times$ and $10.62\times$ geometric means respectively. Finally, the Stream-HLS framework is modularized, extensible, and open-sourced at \url{https://github.com/UCLA-VAST/Stream-HLS} (\url{https://doi.org/10.5281/zenodo.14585909}).
中文: Stream-HLS是基于MLIR的新型框架,能自动将C/C++或PyTorch代码转换为针对FPGA优化的数据流架构,通过全局设计空间探索解决了现有HLS工具的局限,相比现有最优方法实现了显著的性能提升。
English: Stream-HLS is a novel framework built on MLIR that automatically generates optimized dataflow architectures for FPGAs from C/C++ or PyTorch code, overcoming limitations of existing HLS tools by enabling global design space exploration and achieving significant performance improvements over prior methods.

Authors:Huiyu Li, Nicholas Ayache, Hervé Delingette
Title: Generative Medical Image Anonymization Based on Latent Code Projection and Optimization
Abstract:
Medical image anonymization aims to protect patient privacy by removing identifying information, while preserving the data utility to solve downstream tasks. In this paper, we address the medical image anonymization problem with a two-stage solution: latent code projection and optimization. In the projection stage, we design a streamlined encoder to project input images into a latent space and propose a co-training scheme to enhance the projection process. In the optimization stage, we refine the latent code using two deep loss functions designed to address the trade-off between identity protection and data utility dedicated to medical images. Through a comprehensive set of qualitative and quantitative experiments, we showcase the effectiveness of our approach on the MIMIC-CXR chest X-ray dataset by generating anonymized synthetic images that can serve as training set for detecting lung pathologies. Source codes are available at https://github.com/Huiyu-Li/GMIA.
中文摘要:本文提出一种两阶段医学图像匿名化方法,通过潜在空间投影与优化在保护患者隐私的同时保留数据实用性,并在胸部X光数据上验证了其检测肺部病变的有效性。
English Summary: This paper proposes a two-stage medical image anonymization method using latent code projection and optimization to balance privacy protection with data utility, validated on chest X-ray data for lung pathology detection.

Authors:Kanta Masuki, Yuto Ashida
Title: Generative diffusion model with inverse renormalization group flows
Abstract:
Diffusion models represent a class of generative models that produce data by denoising a sample corrupted by white noise. Despite the success of diffusion models in computer vision, audio synthesis, and point cloud generation, so far they overlook inherent multiscale structures in data and have a slow generation process due to many iteration steps. In physics, the renormalization group offers a fundamental framework for linking different scales and giving an accurate coarse-grained model. Here we introduce a renormalization group-based diffusion model that leverages multiscale nature of data distributions for realizing a high-quality data generation. In the spirit of renormalization group procedures, we define a flow equation that progressively erases data information from fine-scale details to coarse-grained structures. Through reversing the renormalization group flows, our model is able to generate high-quality samples in a coarse-to-fine manner. We validate the versatility of the model through applications to protein structure prediction and image generation. Our model consistently outperforms conventional diffusion models across standard evaluation metrics, enhancing sample quality and/or accelerating sampling speed by an order of magnitude. The proposed method alleviates the need for data-dependent tuning of hyperparameters in the generative diffusion models, showing promise for systematically increasing sample efficiency based on the concept of the renormalization group.
Chinese: 本文提出了一种基于重整化群的扩散模型,利用数据的多尺度结构提升样本质量并加速生成过程,在蛋白质结构预测和图像生成等应用中优于传统扩散模型。
English: This paper introduces a renormalization group-based diffusion model that leverages multiscale data structures to enhance sample quality and accelerate generation speed, outperforming conventional diffusion models in applications like protein structure prediction and image generation.

Authors:Zihao Xu, Yuzhi Tang, Bowen Xu, Qingquan Li
Title: NeurOp-Diff:Continuous Remote Sensing Image Super-Resolution via Neural Operator Diffusion
Abstract:
Most publicly accessible remote sensing data suffer from low resolution, limiting their practical applications. To address this, we propose a diffusion model guided by neural operators for continuous remote sensing image super-resolution (NeurOp-Diff). Neural operators are used to learn resolution representations at arbitrary scales, encoding low-resolution (LR) images into high-dimensional features, which are then used as prior conditions to guide the diffusion model for denoising. This effectively addresses the artifacts and excessive smoothing issues present in existing super-resolution (SR) methods, enabling the generation of high-quality, continuous super-resolution images. Specifically, we adjust the super-resolution scale by a scaling factor s, allowing the model to adapt to different super-resolution magnifications. Furthermore, experiments on multiple datasets demonstrate the effectiveness of NeurOp-Diff. Our code is available at https://github.com/zerono000/NeurOp-Diff.
中文: NeurOp-Diff模型通过神经算子学习任意尺度的分辨率表征来引导扩散过程,有效解决了现有超分辨率方法中的伪影和过度平滑问题,能够生成高质量连续的超分辨率遥感图像。
English: The proposed NeurOp-Diff model uses neural operators to learn multi-scale resolution representations and guide a diffusion process, effectively overcoming artifacts and oversmoothing in existing methods to produce high-quality, continuous super-resolution remote sensing images.

Authors:Zheng-An Zhu, Hsin-Che Chien, Chen-Kuo Chiang
Title: TCMM: Token Constraint and Multi-Scale Memory Bank of Contrastive Learning for Unsupervised Person Re-identification
Abstract:
This paper proposes the ViT Token Constraint and Multi-scale Memory bank (TCMM) method to address the patch noises and feature inconsistency in unsupervised person re-identification works. Many excellent methods use ViT features to obtain pseudo labels and clustering prototypes, then train the model with contrastive learning. However, ViT processes images by performing patch embedding, which inevitably introduces noise in patches and may compromise the performance of the re-identification model. On the other hand, previous memory bank based contrastive methods may lead data inconsistency due to the limitation of batch size. Furthermore, existing pseudo label methods often discard outlier samples that are difficult to cluster. It sacrifices the potential value of outlier samples, leading to limited model diversity and robustness. This paper introduces the ViT Token Constraint to mitigate the damage caused by patch noises to the ViT architecture. The proposed Multi-scale Memory enhances the exploration of outlier samples and maintains feature consistency. Experimental results demonstrate that our system achieves state-of-the-art performance on common benchmarks. The project is available at \href{https://github.com/andy412510/TCMM}{https://github.com/andy412510/TCMM}.
中文: 本文提出ViT令牌约束与多尺度记忆库方法,通过减少补丁噪声和增强特征一致性,在无监督行人重识别中有效利用离群样本并提升模型性能,实现了最优的基准测试结果。
English: This paper introduces the ViT Token Constraint and Multi-scale Memory bank (TCMM) method to mitigate patch noise and feature inconsistency in unsupervised person re-identification, achieving state-of-the-art results by enhancing outlier sample utilization and feature stability.

Authors:Jianzi Xiang, Cailu Wan, Zhu Cao
Title: Pseudolabel guided pixels contrast for domain adaptive semantic segmentation
Abstract:
Semantic segmentation is essential for comprehending images, but the process necessitates a substantial amount of detailed annotations at the pixel level. Acquiring such annotations can be costly in the real-world. Unsupervised domain adaptation (UDA) for semantic segmentation is a technique that uses virtual data with labels to train a model and adapts it to real data without labels. Some recent works use contrastive learning, which is a powerful method for self-supervised learning, to help with this technique. However, these works do not take into account the diversity of features within each class when using contrastive learning, which leads to errors in class prediction. We analyze the limitations of these works and propose a novel framework called Pseudo-label Guided Pixel Contrast (PGPC), which overcomes the disadvantages of previous methods. We also investigate how to use more information from target images without adding noise from pseudo-labels. We test our method on two standard UDA benchmarks and show that it outperforms existing methods. Specifically, we achieve relative improvements of 5.1% mIoU and 4.6% mIoU on the Grand Theft Auto V (GTA5) to Cityscapes and SYNTHIA to Cityscapes tasks based on DAFormer, respectively. Furthermore, our approach can enhance the performance of other UDA approaches without increasing model complexity. Code is available at https://github.com/embar111/pgpc
中文: 语义分割依赖昂贵的像素级标注,无监督域自适应结合对比学习虽能辅助,但现有方法忽略了类内特征多样性,导致预测错误;我们提出的伪标签引导像素对比(PGPC)框架通过有效利用目标图像信息解决了这一问题,在标准基准测试中显著提升性能且未增加模型复杂度。
English: Semantic segmentation requires costly pixel-level annotations, and while unsupervised domain adaptation with contrastive learning helps, existing methods overlook intra-class feature diversity, leading to prediction errors; our proposed Pseudo-label Guided Pixel Contrast (PGPC) framework addresses this by leveraging target image information effectively, achieving significant improvements on standard benchmarks without added model complexity.

Authors:Saman Motamed, Laura Culp, Kevin Swersky, Priyank Jaini, Robert Geirhos
Title: Do generative video models understand physical principles?
Abstract:
AI video generation is undergoing a revolution, with quality and realism advancing rapidly. These advances have led to a passionate scientific debate: Do video models learn "world models" that discover laws of physics -- or, alternatively, are they merely sophisticated pixel predictors that achieve visual realism without understanding the physical principles of reality? We address this question by developing Physics-IQ, a comprehensive benchmark dataset that can only be solved by acquiring a deep understanding of various physical principles, like fluid dynamics, optics, solid mechanics, magnetism and thermodynamics. We find that across a range of current models (Sora, Runway, Pika, Lumiere, Stable Video Diffusion, and VideoPoet), physical understanding is severely limited, and unrelated to visual realism. At the same time, some test cases can already be successfully solved. This indicates that acquiring certain physical principles from observation alone may be possible, but significant challenges remain. While we expect rapid advances ahead, our work demonstrates that visual realism does not imply physical understanding. Our project page is at https://physics-iq.github.io; code at https://github.com/google-deepmind/physics-IQ-benchmark.
中文: AI视频生成在视觉真实感上取得显著进展,但通过Physics-IQ基准测试发现,现有模型对物理原理的理解仍非常有限,表明视觉逼真度并不等同于物理认知能力。
English: AI video generation achieves high visual realism but lacks deep physical understanding, as demonstrated by the Physics-IQ benchmark, which shows current models struggle with fundamental physical principles despite some progress.

Authors:Bowen Yi
Title: Unveiling Behavioral Differences in Bilingual Information Operations: A Network-Based Approach
Abstract:
Twitter has become a pivotal platform for conducting information operations (IOs), particularly during high-stakes political events. In this study, we analyze over a million tweets about the 2024 U.S. presidential election to explore an under-studied area: the behavioral differences of IO drivers from English- and Spanish-speaking communities. Using similarity graphs constructed from behavioral patterns, we identify IO drivers in both languages and evaluate the clustering quality of these graphs in an unsupervised setting. Our analysis demonstrates how different network dismantling strategies, such as node pruning and edge filtering, can impact clustering quality and the identification of coordinated IO drivers. We also reveal significant differences in the topics and political indicators between English and Spanish IO drivers. Additionally, we investigate bilingual users who post in both languages, systematically uncovering their distinct roles and behaviors compared to monolingual users. These findings underscore the importance of robust, culturally and linguistically adaptable IO detection methods to mitigate the risks of influence campaigns on social media. Our code and data are available on GitHub: https://github.com/bowenyi-pierre/humans-lab-hackathon-24.
中文: 本研究分析了2024年美国总统选举期间的百万条推文,揭示了英语和西班牙语信息操作驱动者在行为模式与话题偏好上的显著差异,强调了开发跨文化适应检测方法的重要性。
English: This study analyzes over a million tweets from the 2024 U.S. presidential election, revealing behavioral and topical differences between English and Spanish information operation drivers and highlighting the need for culturally adaptive detection methods.

Authors:Ruixiang Jiang, Changwen Chen
Title: Multimodal LLMs Can Reason about Aesthetics in Zero-Shot
Abstract:
The rapid technical progress of generative art (GenArt) has democratized the creation of visually appealing imagery. However, achieving genuine artistic impact - the kind that resonates with viewers on a deeper, more meaningful level - remains formidable as it requires a sophisticated aesthetic sensibility. This sensibility involves a multifaceted cognitive process extending beyond mere visual appeal, which is often overlooked by current computational methods. This paper pioneers an approach to capture this complex process by investigating how the reasoning capabilities of Multimodal LLMs (MLLMs) can be effectively elicited to perform aesthetic judgment. Our analysis reveals a critical challenge: MLLMs exhibit a tendency towards hallucinations during aesthetic reasoning, characterized by subjective opinions and unsubstantiated artistic interpretations. We further demonstrate that these hallucinations can be suppressed by employing an evidence-based and objective reasoning process, as substantiated by our proposed baseline, ArtCoT. MLLMs prompted by this principle produce multifaceted, in-depth aesthetic reasoning that aligns significantly better with human judgment. These findings have direct applications in areas such as AI art tutoring and as reward models for image generation. Ultimately, we hope this work paves the way for AI systems that can truly understand, appreciate, and contribute to art that aligns with human aesthetic values. Project homepage: https://github.com/songrise/MLLM4Art.
中文: 本文提出ArtCoT方法,通过基于证据的推理过程抑制多模态大语言模型在审美判断中的幻觉,使其评估更符合人类审美标准,可应用于AI艺术辅导和图像生成领域。
English: This paper introduces ArtCoT, a method that leverages Multimodal LLMs' reasoning to perform aesthetic judgments by suppressing hallucinations through evidence-based processes, resulting in evaluations that better align with human values and enabling applications in AI art tutoring and image generation.

Authors:Ishan Amin, Sanjeev Raja, Aditi Krishnapriyan
Title: Towards Fast, Specialized Machine Learning Force Fields: Distilling Foundation Models via Energy Hessians
Abstract:
The foundation model (FM) paradigm is transforming Machine Learning Force Fields (MLFFs), leveraging general-purpose representations and scalable training to perform a variety of computational chemistry tasks. Although MLFF FMs have begun to close the accuracy gap relative to first-principles methods, there is still a strong need for faster inference speed. Additionally, while research is increasingly focused on general-purpose models which transfer across chemical space, practitioners typically only study a small subset of systems at a given time. This underscores the need for fast, specialized MLFFs relevant to specific downstream applications, which preserve test-time physical soundness while maintaining train-time scalability. In this work, we introduce a method for transferring general-purpose representations from MLFF foundation models to smaller, faster MLFFs specialized to specific regions of chemical space. We formulate our approach as a knowledge distillation procedure, where the smaller "student" MLFF is trained to match the Hessians of the energy predictions of the "teacher" foundation model. Our specialized MLFFs can be up to 20 $\times$ faster than the original foundation model, while retaining, and in some cases exceeding, its performance and that of undistilled models. We also show that distilling from a teacher model with a direct force parameterization into a student model trained with conservative forces (i.e., computed as derivatives of the potential energy) successfully leverages the representations from the large-scale teacher for improved accuracy, while maintaining energy conservation during test-time molecular dynamics simulations. More broadly, our work suggests a new paradigm for MLFF development, in which foundation models are released along with smaller, specialized simulation "engines" for common chemical subsets.
中文: 基础模型正通过通用表示和可扩展训练改变机器学习力场,但需借助知识蒸馏方法开发更快速、针对特定化学空间的专用模型,以提升推理速度并保持物理准确性。
English: Foundation models are revolutionizing machine learning force fields by enabling scalable training and versatile applications, yet there is a need for faster, specialized models tailored to specific chemical systems through knowledge distillation techniques.

Authors:Qinyu Ma, Yuhao Zhou, Jianfeng Li
Title: Automated Retrosynthesis Planning of Macromolecules Using Large Language Models and Knowledge Graphs
Abstract:
Identifying reliable synthesis pathways in materials chemistry is a complex task, particularly in polymer science, due to the intricate and often non-unique nomenclature of macromolecules. To address this challenge, we propose an agent system that integrates large language models (LLMs) and knowledge graphs. By leveraging LLMs' powerful capabilities for extracting and recognizing chemical substance names, and storing the extracted data in a structured knowledge graph, our system fully automates the retrieval of relevant literatures, extraction of reaction data, database querying, construction of retrosynthetic pathway trees, further expansion through the retrieval of additional literature and recommendation of optimal reaction pathways. By considering the complex interdependencies among chemical reactants, a novel Multi-branched Reaction Pathway Search Algorithm (MBRPS) is proposed to help identify all valid multi-branched reaction pathways, which arise when a single product decomposes into multiple reaction intermediates. In contrast, previous studies were limited to cases where a product decomposes into at most one reaction intermediate. This work represents the first attempt to develop a fully automated retrosynthesis planning agent tailored specially for macromolecules powered by LLMs. Applied to polyimide synthesis, our new approach constructs a retrosynthetic pathway tree with hundreds of pathways and recommends optimized routes, including both known and novel pathways. This demonstrates utilizing LLMs for literature consultation to accomplish specific tasks is possible and crucial for future materials research, given the vast amount of materials-related literature.
中文: 本研究开发了一种结合大语言模型与知识图谱的自动化逆合成规划系统,通过新型多支链反应路径搜索算法,实现了高分子材料合成路径的全面挖掘与优化推荐,突破了传统方法中产物仅能分解为单一中间体的限制。
English: This study introduces an automated retrosynthesis planning agent combining large language models and knowledge graphs to extract polymer reaction data and propose optimal pathways, featuring a novel multi-branched search algorithm that significantly expands synthetic route discovery beyond previous single-intermediate limitations.

Authors:Trevor E. Pogue, Nicola Nicolici
Title: Karatsuba Matrix Multiplication and its Efficient Custom Hardware Implementations
Abstract:
While the Karatsuba algorithm reduces the complexity of large integer multiplication, the extra additions required minimize its benefits for smaller integers of more commonly-used bitwidths. In this work, we propose the extension of the scalar Karatsuba multiplication algorithm to matrix multiplication, showing how this maintains the reduction in multiplication complexity of the original Karatsuba algorithm while reducing the complexity of the extra additions. Furthermore, we propose new matrix multiplication hardware architectures for efficiently exploiting this extension of the Karatsuba algorithm in custom hardware. We show that the proposed algorithm and hardware architectures can provide real area or execution time improvements for integer matrix multiplication compared to scalar Karatsuba or conventional matrix multiplication algorithms, while also supporting implementation through proven systolic array and conventional multiplier architectures at the core. We provide a complexity analysis of the algorithm and architectures and evaluate the proposed designs both in isolation and in an end-to-end deep learning accelerator system compared to baseline designs and prior state-of-the-art works implemented on the same type of compute platform, demonstrating their ability to increase the performance-per-area of matrix multiplication hardware.
Chinese: 本研究将Karatsuba算法扩展至矩阵乘法,在保持原有乘法复杂度优势的同时减少了额外加法操作,并提出了新的硬件架构,相比传统方法显著提升了整数矩阵乘法的单位面积性能。
English: This work extends the Karatsuba algorithm to matrix multiplication, maintaining reduced multiplication complexity while minimizing extra additions, and proposes new hardware architectures that improve performance-per-area for integer matrix multiplication compared to conventional methods.

Authors:Keisuke Kamo, Hideaki Iiduka
Title: Increasing Batch Size Improves Convergence of Stochastic Gradient Descent with Momentum
Abstract:
Stochastic gradient descent with momentum (SGDM), in which a momentum term is added to SGD, has been well studied in both theory and practice. The theoretical studies show that the settings of the learning rate and momentum weight affect the convergence of SGDM. Meanwhile, the practical studies have shown that the batch-size setting strongly affects the performance of SGDM. In this paper, we focus on mini-batch SGDM with a constant learning rate and constant momentum weight, which is frequently used to train deep neural networks. We show theoretically that using a constant batch size does not always minimize the expectation of the full gradient norm of the empirical loss in training a deep neural network, whereas using an increasing batch size definitely minimizes it; that is, an increasing batch size improves the convergence of mini-batch SGDM. We also provide numerical results supporting our analyses, indicating specifically that mini-batch SGDM with an increasing batch size converges to stationary points faster than with a constant batch size, while also reducing computational cost. Python implementations of the optimizers used in the numerical experiments are available at https://github.com/iiduka-researches/NSHB_increasing_batchsize_acml25/.
中文: 本研究证明,在带动量的随机梯度下降法中,采用递增批次大小比固定批次大小能更快收敛到驻点并降低计算成本。
English: This study demonstrates that using an increasing batch size in mini-batch stochastic gradient descent with momentum (SGDM) improves convergence to stationary points and reduces computational cost compared to a constant batch size.

Authors:Tengpeng Li, Hanli Wang, Xianfei Li, Wenlong Liao, Tao He, Pai Peng
Title: Generative Planning with 3D-vision Language Pre-training for End-to-End Autonomous Driving
Abstract:
Autonomous driving is a challenging task that requires perceiving and understanding the surrounding environment for safe trajectory planning. While existing vision-based end-to-end models have achieved promising results, these methods are still facing the challenges of vision understanding, decision reasoning and scene generalization. To solve these issues, a generative planning with 3D-vision language pre-training model named GPVL is proposed for end-to-end autonomous driving. The proposed paradigm has two significant aspects. On one hand, a 3D-vision language pre-training module is designed to bridge the gap between visual perception and linguistic understanding in the bird's eye view. On the other hand, a cross-modal language model is introduced to generate holistic driving decisions and fine-grained trajectories with perception and navigation information in an auto-regressive manner. Experiments on the challenging nuScenes dataset demonstrate that the proposed scheme achieves excellent performances compared with state-of-the-art methods. Besides, the proposed GPVL presents strong generalization ability and real-time potential when handling high-level commands in various scenarios. It is believed that the effective, robust and efficient performance of GPVL is crucial for the practical application of future autonomous driving systems. Code is available at https://github.com/ltp1995/GPVL
Chinese: 提出的GPVL模型通过结合3D视觉语言预训练来弥合感知与语言理解之间的差距,并采用跨模态语言模型生成驾驶决策和轨迹,在nuScenes数据集上实现了卓越的性能和泛化能力。
English: The proposed GPVL model enhances autonomous driving by integrating 3D-vision language pre-training to bridge perception and linguistic understanding, and a cross-modal language model for generating driving decisions and trajectories, achieving superior performance and generalization on the nuScenes dataset.

Authors:Olga Zatsarynna, Emad Bahrami, Yazan Abu Farha, Gianpiero Francesca, Juergen Gall
Title: MANTA: Diffusion Mamba for Efficient and Effective Stochastic Long-Term Dense Anticipation
Abstract:
Long-term dense action anticipation is very challenging since it requires predicting actions and their durations several minutes into the future based on provided video observations. To model the uncertainty of future outcomes, stochastic models predict several potential future action sequences for the same observation. Recent work has further proposed to incorporate uncertainty modelling for observed frames by simultaneously predicting per-frame past and future actions in a unified manner. While such joint modelling of actions is beneficial, it requires long-range temporal capabilities to connect events across distant past and future time points. However, the previous work struggles to achieve such a long-range understanding due to its limited and/or sparse receptive field. To alleviate this issue, we propose a novel MANTA (MAmba for ANTicipation) network. Our model enables effective long-term temporal modelling even for very long sequences while maintaining linear complexity in sequence length. We demonstrate that our approach achieves state-of-the-art results on three datasets - Breakfast, 50Salads, and Assembly101 - while also significantly improving computational and memory efficiency. Our code is available at https://github.com/olga-zats/DIFF_MANTA .
中文: MANTA网络通过实现线性复杂度的有效长程时序建模,解决了长期密集动作预测的挑战,在多个数据集上取得了最先进的结果,同时显著提升了计算效率。
English: The MANTA network addresses the challenge of long-term dense action anticipation by enabling effective long-range temporal modeling with linear complexity, achieving state-of-the-art results on multiple datasets while improving computational efficiency.

Authors:Shao-Hao Lu, Ren Wang, Ching-Chun Huang, Wei-Chen Chiu
Title: Boosting Diffusion Guidance via Learning Degradation-Aware Models for Blind Super Resolution
Abstract:
Recently, diffusion-based blind super-resolution (SR) methods have shown great ability to generate high-resolution images with abundant high-frequency detail, but the detail is often achieved at the expense of fidelity. Meanwhile, another line of research focusing on rectifying the reverse process of diffusion models (i.e., diffusion guidance), has demonstrated the power to generate high-fidelity results for non-blind SR. However, these methods rely on known degradation kernels, making them difficult to apply to blind SR. To address these issues, we present DADiff in this paper. DADiff incorporates degradation-aware models into the diffusion guidance framework, eliminating the need to know degradation kernels. Additionally, we propose two novel techniques: input perturbation and guidance scalar, to further improve our performance. Extensive experimental results show that our proposed method has superior performance over state-of-the-art methods on blind SR benchmarks.
Diffusion-based blind super-resolution methods produce detailed images but often lack fidelity, while existing guidance techniques require known degradation kernels, limiting their use; the proposed DADiff method overcomes these by integrating degradation-aware models and novel techniques to achieve superior performance without needing prior kernel knowledge.
English Summary:

Authors:Zhipeng Ye, Feng Jiang, Qiufeng Wang, Kaizhu Huang, Jiaqi Huang
Title: IDEA: Image Description Enhanced CLIP-Adapter
Abstract:
CLIP (Contrastive Language-Image Pre-training) has attained great success in pattern recognition and computer vision. Transferring CLIP to downstream tasks (e.g. zero- or few-shot classification) is a hot topic in multimodal learning. However, current studies primarily focus on either prompt learning for text or adapter tuning for vision, without fully exploiting the complementary information and correlations among image-text pairs. In this paper, we propose an Image Description Enhanced CLIP-Adapter (IDEA) method to adapt CLIP to few-shot image classification tasks. This method captures fine-grained features by leveraging both visual features and textual descriptions of images. IDEA is a training-free method for CLIP, and it can be comparable to or even exceeds state-of-the-art models on multiple tasks. Furthermore, we introduce Trainable-IDEA (T-IDEA), which extends IDEA by adding two lightweight learnable components (i.e., a projector and a learnable latent space), further enhancing the model's performance and achieving SOTA results on 11 datasets. As one important contribution, we employ the Llama model and design a comprehensive pipeline to generate textual descriptions for images of 11 datasets, resulting in a total of 1,637,795 image-text pairs, named "IMD-11". Our code and data are released at https://github.com/FourierAI/IDEA.
中文: IDEA方法通过结合视觉特征和文本描述,无需训练即可提升CLIP在小样本图像分类中的性能,其可训练扩展T-IDEA利用生成的图像-文本对在11个数据集上取得了领先成果。
English: The IDEA method enhances CLIP for few-shot image classification by integrating visual features and textual descriptions, achieving competitive or superior performance without training, while its trainable extension T-IDEA sets new benchmarks on 11 datasets using generated image-text pairs.

Authors:Shiyu Wu, Jing Liu, Jing Li, Yequan Wang
Title: Few-Shot Learner Generalizes Across AI-Generated Image Detection
Abstract:
Current fake image detectors trained on large synthetic image datasets perform satisfactorily on limited studied generative models. However, these detectors suffer a notable performance decline over unseen models. Besides, collecting adequate training data from online generative models is often expensive or infeasible. To overcome these issues, we propose Few-Shot Detector (FSD), a novel AI-generated image detector which learns a specialized metric space for effectively distinguishing unseen fake images using very few samples. Experiments show that FSD achieves state-of-the-art performance by $+11.6\%$ average accuracy on the GenImage dataset with only $10$ additional samples. More importantly, our method is better capable of capturing the intra-category commonality in unseen images without further training. Our code is available at https://github.com/teheperinko541/Few-Shot-AIGI-Detector.
Chinese: 提出的少样本检测器(FSD)通过学习专门的度量空间,仅需少量样本即可有效识别未知生成模型的伪造图像,在GenImage数据集上以11.6%的准确率提升达到领先性能。
English: The Few-Shot Detector (FSD) effectively identifies fake images from unseen generative models using minimal samples by learning a specialized metric space, achieving state-of-the-art performance with an 11.6% accuracy boost on the GenImage dataset.

Authors:Irina Bigoulaeva, Harish Tayyar Madabushi, Iryna Gurevych
Title: The Inherent Limits of Pretrained LLMs: The Unexpected Convergence of Instruction Tuning and In-Context Learning Capabilities
Abstract:
Large Language Models (LLMs), trained on extensive web-scale corpora, have demonstrated remarkable abilities across diverse tasks, especially as they are scaled up. Nevertheless, even state-of-the-art models struggle in certain cases, sometimes failing at problems solvable by young children, indicating that traditional notions of task complexity are insufficient for explaining LLM capabilities. However, exploring LLM capabilities is complicated by the fact that most widely-used models are also "instruction-tuned" to respond appropriately to prompts. With the goal of disentangling the factors influencing LLM performance, we investigate whether instruction-tuned models possess fundamentally different capabilities from base models that are prompted using in-context examples. Through extensive experiments across various model families, scales and task types, which included instruction tuning 90 different LLMs, we demonstrate that the performance of instruction-tuned models is significantly correlated with the in-context performance of their base counterparts. By clarifying what instruction-tuning contributes, we extend prior research into in-context learning, which suggests that base models use priors from pretraining data to solve tasks. Specifically, we extend this understanding to instruction-tuned models, suggesting that their pretraining data similarly sets a limiting boundary on the tasks they can solve, with the added influence of the instruction-tuning dataset.
Chinese: 大型语言模型的研究表明,指令微调模型的性能与其基础模型的上下文学习能力密切相关,预训练和指令微调数据共同决定了它们的能力边界。
English: Large Language Models (LLMs) show that instruction-tuned models' performance is strongly linked to their base models' in-context learning, with both pretraining and instruction-tuning data shaping their capabilities within set limits.

Authors:Jaemyung Yu, Jaehyun Choi, Dong-Jae Lee, HyeongGwon Hong, Junmo Kim
Title: Self-supervised Transformation Learning for Equivariant Representations
Abstract:
Unsupervised representation learning has significantly advanced various machine learning tasks. In the computer vision domain, state-of-the-art approaches utilize transformations like random crop and color jitter to achieve invariant representations, embedding semantically the same inputs despite transformations. However, this can degrade performance in tasks requiring precise features, such as localization or flower classification. To address this, recent research incorporates equivariant representation learning, which captures transformation-sensitive information. However, current methods depend on transformation labels and thus struggle with interdependency and complex transformations. We propose Self-supervised Transformation Learning (STL), replacing transformation labels with transformation representations derived from image pairs. The proposed method ensures transformation representation is image-invariant and learns corresponding equivariant transformations, enhancing performance without increased batch complexity. We demonstrate the approach's effectiveness across diverse classification and detection tasks, outperforming existing methods in 7 out of 11 benchmarks and excelling in detection. By integrating complex transformations like AugMix, unusable by prior equivariant methods, this approach enhances performance across tasks, underscoring its adaptability and resilience. Additionally, its compatibility with various base models highlights its flexibility and broad applicability. The code is available at https://github.com/jaemyung-u/stl.
中文: 提出的自监督变换学习(STL)方法用图像不变的变换表示替代变换标签,在不增加批次复杂度的前提下实现有效的等变学习,并在多个基准测试中超越现有方法。
English: The proposed Self-supervised Transformation Learning (STL) method replaces transformation labels with image-invariant transformation representations, enabling effective equivariant learning without increasing batch complexity and outperforming existing methods across multiple benchmarks.

Authors:Han Wang, Jianqiang Li, Qing Zhao, Zhonglong Chen, Changwei Song, Jing Tang, Yuning Huang, Wei Zhai, Yongsheng Tong, Guanghui Fu
Title: Deep Learning-Based Feature Fusion for Emotion Analysis and Suicide Risk Differentiation in Chinese Psychological Support Hotlines
Abstract:
Mental health is a critical global public health issue, and psychological support hotlines play a pivotal role in providing mental health assistance and identifying suicide risks at an early stage. However, the emotional expressions conveyed during these calls remain underexplored in current research. This study introduces a method that combines pitch acoustic features with deep learning-based features to analyze and understand emotions expressed during hotline interactions. Using data from China's largest psychological support hotline, our method achieved an F1-score of 79.13% for negative binary emotion classification.Additionally, the proposed approach was validated on an open dataset for multi-class emotion classification,where it demonstrated better performance compared to the state-of-the-art methods. To explore its clinical relevance, we applied the model to analysis the frequency of negative emotions and the rate of emotional change in the conversation, comparing 46 subjects with suicidal behavior to those without. While the suicidal group exhibited more frequent emotional changes than the non-suicidal group, the difference was not statistically significant.Importantly, our findings suggest that emotional fluctuation intensity and frequency could serve as novel features for psychological assessment scales and suicide risk prediction.The proposed method provides valuable insights into emotional dynamics and has the potential to advance early intervention and improve suicide prevention strategies through integration with clinical tools and assessments The source code is publicly available at https://github.com/Sco-field/Speechemotionrecognition/tree/main.
中文: 本研究提出了一种结合音高特征与深度学习的方法来分析心理热线通话中的情绪表达,该方法在情绪分类中表现出色,并发现情绪波动特征可作为自杀风险评估和预防策略的新指标。
English: This study develops a method combining pitch and deep learning features to analyze emotions in psychological hotline calls, achieving high accuracy in emotion classification and identifying emotional fluctuation patterns as potential indicators for suicide risk assessment and prevention strategies.

Authors:Dongzhihan Wang, Yang Yang, Liang Xu
Title: BRIGHT-VO: Brightness-Guided Hybrid Transformer for Visual Odometry with Multi-modality Refinement Module
Abstract:
Visual odometry (VO) plays a crucial role in autonomous driving, robotic navigation, and other related tasks by estimating the position and orientation of a camera based on visual input. Significant progress has been made in data-driven VO methods, particularly those leveraging deep learning techniques to extract image features and estimate camera poses. However, these methods often struggle in low-light conditions because of the reduced visibility of features and the increased difficulty of matching keypoints. To address this limitation, we introduce BrightVO, a novel VO model based on Transformer architecture, which not only performs front-end visual feature extraction, but also incorporates a multi-modality refinement module in the back-end that integrates Inertial Measurement Unit (IMU) data. Using pose graph optimization, this module iteratively refines pose estimates to reduce errors and improve both accuracy and robustness. Furthermore, we create a synthetic low-light dataset, KiC4R, which includes a variety of lighting conditions to facilitate the training and evaluation of VO frameworks in challenging environments. Experimental results demonstrate that BrightVO achieves state-of-the-art performance on both the KiC4R dataset and the KITTI benchmarks. Specifically, it provides an average improvement of 20% in pose estimation accuracy in normal outdoor environments and 259% in low-light conditions, outperforming existing methods. For widespread use and further development, the research work is fully open-source at https://github.com/Anastasiawd/BrightVO.
中文: BrightVO是一种基于Transformer架构的新型视觉里程计模型,它融合了IMU数据和姿态图优化技术,显著提升了姿态估计精度——在低光照环境下性能提升达259%,并配套开源了KiC4R数据集用于训练与评估。
English: BrightVO is a novel Transformer-based visual odometry model that integrates IMU data and pose graph optimization to significantly enhance pose estimation accuracy, particularly achieving a 259% improvement in low-light conditions, and is supported by the open-source KiC4R dataset for training and evaluation.

Authors:Xianqi Wang, Hao Yang, Gangwei Xu, Junda Cheng, Min Lin, Yong Deng, Jinliang Zang, Yurui Chen, Xin Yang
Title: ZeroStereo: Zero-shot Stereo Matching from Single Images
Abstract:
State-of-the-art supervised stereo matching methods have achieved remarkable performance on various benchmarks. However, their generalization to real-world scenarios remains challenging due to the scarcity of annotated real-world stereo data. In this paper, we propose ZeroStereo, a novel stereo image generation pipeline for zero-shot stereo matching. Our approach synthesizes high-quality right images from arbitrary single images by leveraging pseudo disparities generated by a monocular depth estimation model. Unlike previous methods that address occluded regions by filling missing areas with neighboring pixels or random backgrounds, we fine-tune a diffusion inpainting model to recover missing details while preserving semantic structure. Additionally, we propose Training-Free Confidence Generation, which mitigates the impact of unreliable pseudo labels without additional training, and Adaptive Disparity Selection, which ensures a diverse and realistic disparity distribution while preventing excessive occlusion and foreground distortion. Experiments demonstrate that models trained with our pipeline achieve state-of-the-art zero-shot generalization across multiple datasets with only a dataset volume comparable to Scene Flow. Code: https://github.com/Windsrain/ZeroStereo.
Chinese: 提出的ZeroStereo流程通过伪视差和扩散修复技术从单张图像生成高质量立体图像,仅需少量训练数据即可实现最先进的零样本泛化能力。
English: The proposed ZeroStereo pipeline synthesizes high-quality stereo images from single inputs using pseudo disparities and diffusion inpainting, achieving state-of-the-art zero-shot generalization with minimal training data.

Authors:Jiaqi Huang, Zunnan Xu, Ting Liu, Yong Liu, Haonan Han, Kehong Yuan, Xiu Li
Title: Densely Connected Parameter-Efficient Tuning for Referring Image Segmentation
Abstract:
In the domain of computer vision, Parameter-Efficient Tuning (PET) is increasingly replacing the traditional paradigm of pre-training followed by full fine-tuning. PET is particularly favored for its effectiveness in large foundation models, as it streamlines transfer learning costs and optimizes hardware utilization. However, the current PET methods are mainly designed for single-modal optimization. While some pioneering studies have undertaken preliminary explorations, they still remain at the level of aligned encoders (e.g., CLIP) and lack exploration of misaligned encoders. These methods show sub-optimal performance with misaligned encoders, as they fail to effectively align the multimodal features during fine-tuning. In this paper, we introduce DETRIS, a parameter-efficient tuning framework designed to enhance low-rank visual feature propagation by establishing dense interconnections between each layer and all preceding layers, which enables effective cross-modal feature interaction and adaptation to misaligned encoders. We also suggest using text adapters to improve textual features. Our simple yet efficient approach greatly surpasses state-of-the-art methods with 0.9% to 1.8% backbone parameter updates, evaluated on challenging benchmarks. Our project is available at \url{https://github.com/jiaqihuang01/DETRIS}.
中文: DETRIS提出了一种参数高效调优框架,通过增强多模态特征交互来适配未对齐编码器,仅需少量参数更新即可实现卓越性能。
English: DETRIS introduces a parameter-efficient tuning framework that enhances multimodal feature interaction for misaligned encoders, achieving superior performance with minimal parameter updates.

Authors:Qian Wang, Jiaying Wu, Zhenheng Tang, Bingqiao Luo, Nuo Chen, Wei Chen, Bingsheng He
Title: What Limits LLM-based Human Simulation: LLMs or Our Design?
Abstract:
We argue that advancing LLM-based human simulation requires addressing both LLM's inherent limitations and simulation framework design challenges. Recent studies have revealed significant gaps between LLM-based human simulations and real-world observations, highlighting these dual challenges. To address these gaps, we present a comprehensive analysis of LLM limitations and our design issues, proposing targeted solutions for both aspects. Furthermore, we explore future directions that address both challenges simultaneously, particularly in data collection, LLM generation, and evaluation. To support further research in this field, we provide a curated collection of LLM-based human simulation resources.\footnote{https://github.com/Persdre/llm-human-simulation}
推进基于大语言模型的人类模拟需同时解决模型固有局限与框架设计难题,通过针对性方案和未来研究方向缩小其与真实观察间的差距。
Advancing LLM-based human simulation requires addressing both inherent model limitations and framework design challenges, with proposed solutions and future directions to bridge gaps with real-world observations.

Authors:Kewei Li, Yanwen Kong, Yiping Xu, Jianlin Su, Lan Huang, Ruochi Zhang, Fengfeng Zhou
Title: Information Entropy Invariance: Enhancing Length Extrapolation in Attention Mechanisms
Abstract:
Since the emergence of research on improving the length extrapolation capabilities of large language models in 2021, some studies have made modifications to the scaling factor in the scaled dot-product attention mechanism as part of their proposed methods without rigorous theoretical justifications. To fill this gap, we propose two new scaled temperatures based on information entropy invariance to enhance length extrapolation. First, a training-free method InfoScale is designed for dotproduct attention, and preserves focus on original tokens during length extrapolation by ensuring consistent entropy. Second, we theoretically analyze the impact of scaling (CosScale) on cosine attention. Experimental data demonstrates that combining InfoScale and CosScale achieves state-ofthe-art performance on the GAU-α model with a context window extended to 64 times the training length, and outperforms seven existing methods. Our analysis reveals that significantly increasing CosScale approximates the Windowed Attention, and highlights the significance of attention score dilution as a key challenge in long-range context handling. The code and data are available at https://github.com/HT-NEKO/ Information-Entropy-Invariance.
中文: 本研究基于信息熵不变性提出了InfoScale和CosScale两种新缩放方法,显著增强了语言模型的长度外推能力,通过将上下文窗口扩展至训练长度的64倍实现了最优性能。
English: This study introduces InfoScale and CosScale, two novel scaling methods based on information entropy invariance, which significantly enhance length extrapolation in language models and achieve state-of-the-art performance by extending context windows up to 64 times the training length.

Authors:Oscar Ramos-Soto, Jorge Ramos-Frutos, Ezequiel Perez-Zarate, Diego Oliva, Sandra E. Balderas-Mata
Title: MIAFEx: An Attention-based Feature Extraction Method for Medical Image Classification
Abstract:
Feature extraction techniques are crucial in medical image classification; however, classical feature extractors in addition to traditional machine learning classifiers often exhibit significant limitations in providing sufficient discriminative information for complex image sets. While Convolutional Neural Networks (CNNs) and Vision Transformer (ViT) have shown promise in feature extraction, they are prone to overfitting due to the inherent characteristics of medical imaging data, including small sample sizes or high intra-class variance. In this work, the Medical Image Attention-based Feature Extractor (MIAFEx) is proposed, a novel method that employs a learnable refinement mechanism to enhance the classification token within the Transformer encoder architecture. This mechanism adjusts the token based on learned weights, improving the extraction of salient features and enhancing the model's adaptability to the challenges presented by medical imaging data. The MIAFEx output features quality is compared against classical feature extractors using traditional and hybrid classifiers. Also, the performance of these features is compared against modern CNN and ViT models in classification tasks, demonstrating its superiority in accuracy and robustness across multiple complex classification medical imaging datasets. This advantage is particularly pronounced in scenarios with limited training data, where traditional and modern models often struggle to generalize effectively. The source code of this proposal can be found at https://github.com/Oscar-RamosS/Medical-Image-Attention-based-Feature-Extractor-MIAFEx
Chinese: 提出的MIAFEx方法通过可学习的优化机制改进Transformer架构中的分类标记,在医学图像特征提取中显著优于传统和现代模型,尤其在数据有限时展现出更优的准确性与鲁棒性。
English: The proposed MIAFEx method enhances Transformer-based feature extraction for medical images by refining classification tokens with a learnable mechanism, outperforming both classical and modern models in accuracy and robustness, especially with limited data.

Authors:Oscar Ramos-Soto, Jorge Ramos-Frutos, Ezequiel Perez-Zarate, Diego Oliva, Sandra E. Balderas-Mata
Title: MIAFEx: An Attention-based Feature Extraction Method for Medical Image Classification
Abstract:
Feature extraction techniques are crucial in medical image classification; however, classical feature extractors, in addition to traditional machine learning classifiers, often exhibit significant limitations in providing sufficient discriminative information for complex image sets. While Convolutional Neural Networks (CNNs) and Vision Transformer (ViT) have shown promise in feature extraction, they are prone to overfitting due to the inherent characteristics of medical imaging data, including small sample sizes or high intra-class variance. In this work, the Medical Image Attention-based Feature Extractor (MIAFEx) is proposed, a novel method that employs a learnable refinement mechanism to enhance the classification token within the Transformer encoder architecture. This mechanism adjusts the token based on learned weights, improving the extraction of salient features and enhancing the model's adaptability to the challenges presented by medical imaging data. The MIAFEx output feature quality is compared against classical feature extractors using traditional and hybrid classifiers. Also, the performance of these features is compared against modern CNN and ViT models in classification tasks, demonstrating their superiority in accuracy and robustness across multiple complex medical imaging datasets. This advantage is particularly pronounced in scenarios with limited training data, where traditional and modern models often struggle to generalize effectively. The source code of this proposal can be found at https://github.com/Oscar-RamosS/Medical-Image-Attention-based-Feature-Extractor-MIAFEx
Chinese: 提出的MIAFEx方法通过可学习的优化机制改进Transformer架构中的分类标记,在医学图像特征提取中显著优于传统和现代模型,尤其在数据有限时展现出更优的准确性与鲁棒性。
English: The proposed MIAFEx method enhances Transformer-based feature extraction for medical images by refining classification tokens with a learnable mechanism, outperforming both classical and modern models in accuracy and robustness, especially with limited data.

Authors:Matthieu Kirchmeyer, Pedro O. Pinheiro, Saeed Saremi
Title: Score-based 3D molecule generation with neural fields
Abstract:
We introduce a new representation for 3D molecules based on their continuous atomic density fields. Using this representation, we propose a new model based on walk-jump sampling for unconditional 3D molecule generation in the continuous space using neural fields. Our model, FuncMol, encodes molecular fields into latent codes using a conditional neural field, samples noisy codes from a Gaussian-smoothed distribution with Langevin MCMC (walk), denoises these samples in a single step (jump), and finally decodes them into molecular fields. FuncMol performs all-atom generation of 3D molecules without assumptions on the molecular structure and scales well with the size of molecules, unlike most approaches. Our method achieves competitive results on drug-like molecules and easily scales to macro-cyclic peptides, with at least one order of magnitude faster sampling. The code is available at https://github.com/prescient-design/funcmol.
中文: 我们提出FuncMol模型,利用连续原子密度场和行走-跳跃采样方法,实现高效的无条件三维分子生成,在保持竞争力的同时显著提升了采样速度。
English: We propose FuncMol, a novel model that uses continuous atomic density fields and walk-jump sampling for efficient unconditional 3D molecule generation, achieving competitive results and faster sampling speeds.

Authors:Ryan Burgert, Yuancheng Xu, Wenqi Xian, Oliver Pilarski, Pascal Clausen, Mingming He, Li Ma, Yitong Deng, Lingxiao Li, Mohsen Mousavi, Michael Ryoo, Paul Debevec, Ning Yu
Title: Go-with-the-Flow: Motion-Controllable Video Diffusion Models Using Real-Time Warped Noise
Abstract:
Generative modeling aims to transform random noise into structured outputs. In this work, we enhance video diffusion models by allowing motion control via structured latent noise sampling. This is achieved by just a change in data: we pre-process training videos to yield structured noise. Consequently, our method is agnostic to diffusion model design, requiring no changes to model architectures or training pipelines. Specifically, we propose a novel noise warping algorithm, fast enough to run in real time, that replaces random temporal Gaussianity with correlated warped noise derived from optical flow fields, while preserving the spatial Gaussianity. The efficiency of our algorithm enables us to fine-tune modern video diffusion base models using warped noise with minimal overhead, and provide a one-stop solution for a wide range of user-friendly motion control: local object motion control, global camera movement control, and motion transfer. The harmonization between temporal coherence and spatial Gaussianity in our warped noise leads to effective motion control while maintaining per-frame pixel quality. Extensive experiments and user studies demonstrate the advantages of our method, making it a robust and scalable approach for controlling motion in video diffusion models. Video results are available on our webpage: https://eyeline-labs.github.io/Go-with-the-Flow. Source code and model checkpoints are available on GitHub: https://github.com/Eyeline-Labs/Go-with-the-Flow.
中文: 本研究提出一种噪声扭曲算法,通过利用光流生成的结构化噪声实现对视频扩散模型的实时运动控制,且无需改变模型架构或训练流程。
English: This study introduces a noise warping algorithm that enables real-time motion control in video diffusion models by using structured noise derived from optical flow, without altering model architecture or training processes.

Authors:Anastasios N. Angelopoulos, Michael I. Jordan, Ryan J. Tibshirani
Title: Gradient Equilibrium in Online Learning: Theory and Applications
Abstract:
We present a new perspective on online learning that we refer to as gradient equilibrium: a sequence of iterates achieves gradient equilibrium if the average of gradients of losses along the sequence converges to zero. In general, this condition is not implied by, nor implies, sublinear regret. It turns out that gradient equilibrium is achievable by standard online learning methods such as gradient descent and mirror descent with constant step sizes (rather than decaying step sizes, as is usually required for no regret). Further, as we show through examples, gradient equilibrium translates into an interpretable and meaningful property in online prediction problems spanning regression, classification, quantile estimation, and others. Notably, we show that the gradient equilibrium framework can be used to develop a debiasing scheme for black-box predictions under arbitrary distribution shift, based on simple post hoc online descent updates. We also show that post hoc gradient updates can be used to calibrate predicted quantiles under distribution shift, and that the framework leads to unbiased Elo scores for pairwise preference prediction.
中文: 梯度均衡这一在线学习新视角指损失函数梯度序列的平均值收敛于零,可通过恒定步长方法实现,并在回归、分类等预测问题中提供可解释的优势,如应对分布偏移的去偏和校准。
English: The concept of gradient equilibrium in online learning is achieved when the average of gradients converges to zero, which can be attained through constant step-size methods like gradient descent and offers interpretable benefits across various prediction tasks, including debiasing under distribution shifts.

Authors:MiniMax, Aonian Li, Bangwei Gong, Bo Yang, Boji Shan, Chang Liu, Cheng Zhu, Chunhao Zhang, Congchao Guo, Da Chen, Dong Li, Enwei Jiao, Gengxin Li, Guojun Zhang, Haohai Sun, Houze Dong, Jiadai Zhu, Jiaqi Zhuang, Jiayuan Song, Jin Zhu, Jingtao Han, Jingyang Li, Junbin Xie, Junhao Xu, Junjie Yan, Kaishun Zhang, Kecheng Xiao, Kexi Kang, Le Han, Leyang Wang, Lianfei Yu, Liheng Feng, Lin Zheng, Linbo Chai, Long Xing, Meizhi Ju, Mingyuan Chi, Mozhi Zhang, Peikai Huang, Pengcheng Niu, Pengfei Li, Pengyu Zhao, Qi Yang, Qidi Xu, Qiexiang Wang, Qin Wang, Qiuhui Li, Ruitao Leng, Shengmin Shi, Shuqi Yu, Sichen Li, Songquan Zhu, Tao Huang, Tianrun Liang, Weigao Sun, Weixuan Sun, Weiyu Cheng, Wenkai Li, Xiangjun Song, Xiao Su, Xiaodong Han, Xinjie Zhang, Xinzhu Hou, Xu Min, Xun Zou, Xuyang Shen, Yan Gong, Yingjie Zhu, Yipeng Zhou, Yiran Zhong, Yongyi Hu, Yuanxiang Fan, Yue Yu, Yufeng Yang, Yuhao Li, Yunan Huang, Yunji Li, Yunpeng Huang, Yunzhi Xu, Yuxin Mao, Zehan Li, Zekang Li, Zewei Tao, Zewen Ying, Zhaoyang Cong, Zhen Qin, Zhenhua Fan, Zhihang Yu, Zhuo Jiang, Zijia Wu
Title: MiniMax-01: Scaling Foundation Models with Lightning Attention
Abstract:
We introduce MiniMax-01 series, including MiniMax-Text-01 and MiniMax-VL-01, which are comparable to top-tier models while offering superior capabilities in processing longer contexts. The core lies in lightning attention and its efficient scaling. To maximize computational capacity, we integrate it with Mixture of Experts (MoE), creating a model with 32 experts and 456 billion total parameters, of which 45.9 billion are activated for each token. We develop an optimized parallel strategy and highly efficient computation-communication overlap techniques for MoE and lightning attention. This approach enables us to conduct efficient training and inference on models with hundreds of billions of parameters across contexts spanning millions of tokens. The context window of MiniMax-Text-01 can reach up to 1 million tokens during training and extrapolate to 4 million tokens during inference at an affordable cost. Our vision-language model, MiniMax-VL-01 is built through continued training with 512 billion vision-language tokens. Experiments on both standard and in-house benchmarks show that our models match the performance of state-of-the-art models like GPT-4o and Claude-3.5-Sonnet while offering 20-32 times longer context window. We publicly release MiniMax-01 at https://github.com/MiniMax-AI.
MiniMax-01系列模型融合闪电注意力与专家混合技术,在保持顶尖性能的同时突破上下文长度限制,以更低成本实现百万级 token 的高效处理。
MiniMax-01系列模型通过闪电注意力和专家混合架构实现了与顶尖模型相媲美的性能,同时支持超长上下文处理,训练和推理成本显著降低。

Authors:Wennuo Yang, Shiling Wu, Yuzhi Zhou, Cheng Luo, Xilin He, Weicheng Xie, Linlin Shen, Siyang Song
Title: Benchmarking Graph Representations and Graph Neural Networks for Multivariate Time Series Classification
Abstract:
Multivariate Time Series Classification (MTSC) enables the analysis if complex temporal data, and thus serves as a cornerstone in various real-world applications, ranging from healthcare to finance. Since the relationship among variables in MTS usually contain crucial cues, a large number of graph-based MTSC approaches have been proposed, as the graph topology and edges can explicitly represent relationships among variables (channels), where not only various MTS graph representation learning strategies but also different Graph Neural Networks (GNNs) have been explored. Despite such progresses, there is no comprehensive study that fairly benchmarks and investigates the performances of existing widely-used graph representation learning strategies/GNN classifiers in the application of different MTSC tasks. In this paper, we present the first benchmark which systematically investigates the effectiveness of the widely-used three node feature definition strategies, four edge feature learning strategies and five GNN architecture, resulting in 60 different variants for graph-based MTSC. These variants are developed and evaluated with a standardized data pipeline and training/validation/testing strategy on 26 widely-used suspensor MTSC datasets. Our experiments highlight that node features significantly influence MTSC performance, while the visualization of edge features illustrates why adaptive edge learning outperforms other edge feature learning methods. The code of the proposed benchmark is publicly available at \url{https://github.com/CVI-yangwn/Benchmark-GNN-for-Multivariate-Time-Series-Classification}.
中文: 本文首次建立了基于图的多变量时间序列分类基准,通过系统评估26个数据集上的60种节点/边特征策略与图神经网络架构组合,发现节点特征对分类性能影响显著,且自适应边学习方法最具优势。
English: This paper introduces the first comprehensive benchmark for graph-based Multivariate Time Series Classification, systematically evaluating 60 combinations of node/edge feature strategies and GNN architectures across 26 datasets, revealing that node features critically impact performance while adaptive edge learning proves most effective.

Authors:Efstathios Karypidis, Ioannis Kakogeorgiou, Spyros Gidaris, Nikos Komodakis
Title: Advancing Semantic Future Prediction through Multimodal Visual Sequence Transformers
Abstract:
Semantic future prediction is important for autonomous systems navigating dynamic environments. This paper introduces FUTURIST, a method for multimodal future semantic prediction that uses a unified and efficient visual sequence transformer architecture. Our approach incorporates a multimodal masked visual modeling objective and a novel masking mechanism designed for multimodal training. This allows the model to effectively integrate visible information from various modalities, improving prediction accuracy. Additionally, we propose a VAE-free hierarchical tokenization process, which reduces computational complexity, streamlines the training pipeline, and enables end-to-end training with high-resolution, multimodal inputs. We validate FUTURIST on the Cityscapes dataset, demonstrating state-of-the-art performance in future semantic segmentation for both short- and mid-term forecasting. We provide the implementation code at https://github.com/Sta8is/FUTURIST .
中文: 本文提出FUTURIST方法,采用统一变换器架构结合新型掩码机制和无VAE分层标记化,在Cityscapes数据集上实现了短期和中期未来语义预测的最先进性能。
English: This paper presents FUTURIST, a multimodal future semantic prediction method using a unified transformer architecture with a novel masking mechanism and VAE-free tokenization, achieving state-of-the-art performance on Cityscapes for short- and mid-term forecasting.

Authors:Shamsuddeen Hassan Muhammad, Idris Abdulmumin, Abinew Ali Ayele, David Ifeoluwa Adelani, Ibrahim Said Ahmad, Saminu Mohammad Aliyu, Nelson Odhiambo Onyango, Lilian D. A. Wanzare, Samuel Rutunda, Lukman Jibril Aliyu, Esubalew Alemneh, Oumaima Hourrane, Hagos Tesfahun Gebremichael, Elyas Abdi Ismail, Meriem Beloucif, Ebrahim Chekol Jibril, Andiswa Bukula, Rooweither Mabuya, Salomey Osei, Abigail Oppong, Tadesse Destaw Belay, Tadesse Kebede Guge, Tesfa Tegegne Asfaw, Chiamaka Ijeoma Chukwuneke, Paul Röttger, Seid Muhie Yimam, Nedjma Ousidhoum
Title: AfriHate: A Multilingual Collection of Hate Speech and Abusive Language Datasets for African Languages
Abstract:
Hate speech and abusive language are global phenomena that need socio-cultural background knowledge to be understood, identified, and moderated. However, in many regions of the Global South, there have been several documented occurrences of (1) absence of moderation and (2) censorship due to the reliance on keyword spotting out of context. Further, high-profile individuals have frequently been at the center of the moderation process, while large and targeted hate speech campaigns against minorities have been overlooked. These limitations are mainly due to the lack of high-quality data in the local languages and the failure to include local communities in the collection, annotation, and moderation processes. To address this issue, we present AfriHate: a multilingual collection of hate speech and abusive language datasets in 15 African languages. Each instance in AfriHate is annotated by native speakers familiar with the local culture. We report the challenges related to the construction of the datasets and present various classification baseline results with and without using LLMs. The datasets, individual annotations, and hate speech and offensive language lexicons are available on https://github.com/AfriHate/AfriHate
中文摘要:全球南方地区因缺乏本地语言数据和忽视文化背景而存在仇恨言论审核不足的问题,AfriHate数据集通过提供15种非洲语言的原生标注内容,有效提升了仇恨言论的识别与分类能力。
English Summary: Hate speech moderation in the Global South faces challenges from inadequate data and cultural context, which the AfriHate dataset addresses by providing native-annotated content in 15 African languages to improve detection and classification.

Authors:Hongyu Li, Jinyu Chen, Ziyu Wei, Shaofei Huang, Tianrui Hui, Jialin Gao, Xiaoming Wei, Si Liu
Title: LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding
Abstract:
Recent advancements in multimodal large language models (MLLMs) have shown promising results, yet existing approaches struggle to effectively handle both temporal and spatial localization simultaneously. This challenge stems from two key issues: first, incorporating spatial-temporal localization introduces a vast number of coordinate combinations, complicating the alignment of linguistic and visual coordinate representations; second, encoding fine-grained temporal and spatial information during video feature compression is inherently difficult. To address these issues, we propose LLaVA-ST, a MLLM for fine-grained spatial-temporal multimodal understanding. In LLaVA-ST, we propose Language-Aligned Positional Embedding, which embeds the textual coordinate special token into the visual space, simplifying the alignment of fine-grained spatial-temporal correspondences. Additionally, we design the Spatial-Temporal Packer, which decouples the feature compression of temporal and spatial resolutions into two distinct point-to-region attention processing streams. Furthermore, we propose ST-Align dataset with 4.3M training samples for fine-grained spatial-temporal multimodal understanding. With ST-align, we present a progressive training pipeline that aligns the visual and textual feature through sequential coarse-to-fine stages.Additionally, we introduce an ST-Align benchmark to evaluate spatial-temporal interleaved fine-grained understanding tasks, which include Spatial-Temporal Video Grounding (STVG) , Event Localization and Captioning (ELC) and Spatial Video Grounding (SVG). LLaVA-ST achieves outstanding performance on 11 benchmarks requiring fine-grained temporal, spatial, or spatial-temporal interleaving multimodal understanding. Our code, data and benchmark will be released at Our code, data and benchmark will be released at https://github.com/appletea233/LLaVA-ST .
中文摘要:该摘要介绍了LLaVA-ST模型,它通过创新的语言对齐位置嵌入和时空特征压缩方法,解决了多模态理解中时空定位的难题,并在多个基准测试中取得了优异表现。
English Summary: The abstract introduces LLaVA-ST, a multimodal large language model designed to overcome challenges in simultaneous spatial-temporal localization by using innovative embedding and feature compression techniques, achieving top performance across multiple benchmarks.

Authors:Rui Daniel, M. Rita Verdelho, Catarina Barata, Carlos Santiago
Title: Continual Deep Active Learning for Medical Imaging: Replay-Base Architecture for Context Adaptation
Abstract:
Deep Learning for medical imaging faces challenges in adapting and generalizing to new contexts. Additionally, it often lacks sufficient labeled data for specific tasks requiring significant annotation effort. Continual Learning (CL) tackles adaptability and generalizability by enabling lifelong learning from a data stream while mitigating forgetting of previously learned knowledge. Active Learning (AL) reduces the number of required annotations for effective training. This work explores both approaches (CAL) to develop a novel framework for robust medical image analysis. Based on the automatic recognition of shifts in image characteristics, Replay-Base Architecture for Context Adaptation (RBACA) employs a CL rehearsal method to continually learn from diverse contexts, and an AL component to select the most informative instances for annotation. A novel approach to evaluate CAL methods is established using a defined metric denominated IL-Score, which allows for the simultaneous assessment of transfer learning, forgetting, and final model performance. We show that RBACA works in domain and class-incremental learning scenarios, by assessing its IL-Score on the segmentation and diagnosis of cardiac images. The results show that RBACA outperforms a baseline framework without CAL, and a state-of-the-art CAL method across various memory sizes and annotation budgets. Our code is available in https://github.com/RuiDaniel/RBACA .
中文摘要:本研究提出RBACA框架,融合持续学习与主动学习,通过最小化标注需求适应新场景,在心脏图像分割与诊断中展现出卓越性能。
English Summary: This study introduces RBACA, a novel framework combining continual and active learning to enhance medical image analysis by adapting to new contexts with minimal annotation, achieving superior performance in cardiac image segmentation and diagnosis.

Authors:Yabo Zhang, Xinpeng Zhou, Yihan Zeng, Hang Xu, Hui Li, Wangmeng Zuo
Title: FramePainter: Endowing Interactive Image Editing with Video Diffusion Priors
Abstract:
Interactive image editing allows users to modify images through visual interaction operations such as drawing, clicking, and dragging. Existing methods construct such supervision signals from videos, as they capture how objects change with various physical interactions. However, these models are usually built upon text-to-image diffusion models, so necessitate (i) massive training samples and (ii) an additional reference encoder to learn real-world dynamics and visual consistency. In this paper, we reformulate this task as an image-to-video generation problem, so that inherit powerful video diffusion priors to reduce training costs and ensure temporal consistency. Specifically, we introduce FramePainter as an efficient instantiation of this formulation. Initialized with Stable Video Diffusion, it only uses a lightweight sparse control encoder to inject editing signals. Considering the limitations of temporal attention in handling large motion between two frames, we further propose matching attention to enlarge the receptive field while encouraging dense correspondence between edited and source image tokens. We highlight the effectiveness and efficiency of FramePainter across various of editing signals: it domainantly outperforms previous state-of-the-art methods with far less training data, achieving highly seamless and coherent editing of images, \eg, automatically adjust the reflection of the cup. Moreover, FramePainter also exhibits exceptional generalization in scenarios not present in real-world videos, \eg, transform the clownfish into shark-like shape. Our code will be available at https://github.com/YBYBZhang/FramePainter.
Chinese Summary: FramePainter将交互式图像编辑重构为图像到视频生成问题,利用视频扩散先验知识,通过轻量级控制编码器实现高质量、时序一致的编辑效果,大幅减少了训练数据需求。
English Summary: FramePainter reformulates interactive image editing as an image-to-video generation task, leveraging video diffusion priors to achieve high-quality, temporally consistent edits with minimal training data and a lightweight control encoder.

Authors:Jinjun Peng, Leyi Cui, Kele Huang, Junfeng Yang, Baishakhi Ray
Title: CWEval: Outcome-driven Evaluation on Functionality and Security of LLM Code Generation
Abstract:
Large Language Models (LLMs) have significantly aided developers by generating or assisting in code writing, enhancing productivity across various tasks. While identifying incorrect code is often straightforward, detecting vulnerabilities in functionally correct code is more challenging, especially for developers with limited security knowledge, which poses considerable security risks of using LLM-generated code and underscores the need for robust evaluation benchmarks that assess both functional correctness and security. Current benchmarks like CyberSecEval and SecurityEval attempt to solve it but are hindered by unclear and impractical specifications, failing to assess both functionality and security accurately. To tackle these deficiencies, we introduce CWEval, a novel outcome-driven evaluation framework designed to enhance the evaluation of secure code generation by LLMs. This framework not only assesses code functionality but also its security simultaneously with high-quality task specifications and outcome-driven test oracles which provides high accuracy. Coupled with CWEval-bench, a multilingual, security-critical coding benchmark, CWEval provides a rigorous empirical security evaluation on LLM-generated code, overcoming previous benchmarks' shortcomings. Through our evaluations, CWEval reveals a notable portion of functional but insecure code produced by LLMs, and shows a serious inaccuracy of previous evaluations, ultimately contributing significantly to the field of secure code generation. We open-source our artifact at: https://github.com/Co1lin/CWEval .
Chinese: CWEval 是一个结果驱动的评估框架,通过高质量任务规范和结果驱动的测试机制,同时评估代码功能性和安全性,有效克服了以往基准测试的不足,显著提升了LLM生成代码的安全评估水平。
English: CWEval is an outcome-driven framework that enhances secure code generation evaluation by simultaneously assessing both functionality and security, overcoming limitations of previous benchmarks through high-quality specifications and multilingual testing.

Authors:Yijiong Yu, Ziyun Dai, Zekun Wang, Wei Wang, Ran Chen, Ji Pei
Title: OpenCSG Chinese Corpus: A Series of High-quality Chinese Datasets for LLM Training
Abstract:
Large language models (LLMs) have demonstrated remarkable capabilities, but their success heavily relies on the quality of pretraining corpora. For Chinese LLMs, the scarcity of high-quality Chinese datasets presents a significant challenge, often limiting their performance. To address this issue, we propose the OpenCSG Chinese Corpus, a series of high-quality datasets specifically designed for LLM pretraining, post-training, and fine-tuning. This corpus includes Fineweb-edu-chinese, Fineweb-edu-chinese-v2, Cosmopedia-chinese, and Smoltalk-chinese, each with distinct characteristics: Fineweb-edu datasets focus on filtered, high-quality content derived from diverse Chinese web sources; Cosmopedia-chinese provides synthetic, textbook-style data for knowledge-intensive training; and Smoltalk-chinese emphasizes stylistic and diverse chat-format data. The OpenCSG Chinese Corpus is characterized by its high-quality text, diverse coverage across domains, and scalable, reproducible data curation processes. Additionally, we conducted extensive experimental analyses, including evaluations on smaller parameter models, which demonstrated significant performance improvements in tasks such as C-Eval, showcasing the effectiveness of the corpus for training Chinese LLMs.
中文: OpenCSG中文语料库通过提供多样化的高质量数据集,有效解决了中文大语言模型训练数据匮乏的问题,显著提升了模型在C-Eval等任务中的表现。
English: The OpenCSG Chinese Corpus addresses the scarcity of high-quality Chinese datasets for LLMs by providing diverse, scalable datasets that significantly enhance model performance in tasks like C-Eval.

Authors:Yin Fang, Xinle Deng, Kangwei Liu, Ningyu Zhang, Jingyang Qian, Penghui Yang, Xiaohui Fan, Huajun Chen
Title: A Multi-Modal AI Copilot for Single-Cell Analysis with Instruction Following
Abstract:
Large language models excel at interpreting complex natural language instructions, enabling them to perform a wide range of tasks. In the life sciences, single-cell RNA sequencing (scRNA-seq) data serves as the "language of cellular biology", capturing intricate gene expression patterns at the single-cell level. However, interacting with this "language" through conventional tools is often inefficient and unintuitive, posing challenges for researchers. To address these limitations, we present InstructCell, a multi-modal AI copilot that leverages natural language as a medium for more direct and flexible single-cell analysis. We construct a comprehensive multi-modal instruction dataset that pairs text-based instructions with scRNA-seq profiles from diverse tissues and species. Building on this, we develop a multi-modal cell language architecture capable of simultaneously interpreting and processing both modalities. InstructCell empowers researchers to accomplish critical tasks-such as cell type annotation, conditional pseudo-cell generation, and drug sensitivity prediction-using straightforward natural language commands. Extensive evaluations demonstrate that InstructCell consistently meets or exceeds the performance of existing single-cell foundation models, while adapting to diverse experimental conditions. More importantly, InstructCell provides an accessible and intuitive tool for exploring complex single-cell data, lowering technical barriers and enabling deeper biological insights.
中文: InstructCell作为多模态AI助手,通过自然语言实现对单细胞RNA测序数据的直观灵活分析,在细胞注释和药物预测等任务中优于现有模型,同时显著降低了复杂生物数据的技术门槛。
English: InstructCell is a multimodal AI copilot that uses natural language to enable intuitive and flexible analysis of single-cell RNA sequencing data, outperforming existing models in tasks like cell annotation and drug prediction while making complex biological data more accessible.

Authors:Qian Zeng, Jie Song, Han Zheng, Hao Jiang, Mingli Song
Title: D$^2$-DPM: Dual Denoising for Quantized Diffusion Probabilistic Models
Abstract:
Diffusion models have achieved cutting-edge performance in image generation. However, their lengthy denoising process and computationally intensive score estimation network impede their scalability in low-latency and resource-constrained scenarios. Post-training quantization (PTQ) compresses and accelerates diffusion models without retraining, but it inevitably introduces additional quantization noise, resulting in mean and variance deviations. In this work, we propose D2-DPM, a dual denoising mechanism aimed at precisely mitigating the adverse effects of quantization noise on the noise estimation network. Specifically, we first unravel the impact of quantization noise on the sampling equation into two components: the mean deviation and the variance deviation. The mean deviation alters the drift coefficient of the sampling equation, influencing the trajectory trend, while the variance deviation magnifies the diffusion coefficient, impacting the convergence of the sampling trajectory. The proposed D2-DPM is thus devised to denoise the quantization noise at each time step, and then denoise the noisy sample through the inverse diffusion iterations. Experimental results demonstrate that D2-DPM achieves superior generation quality, yielding a 1.42 lower FID than the full-precision model while achieving 3.99x compression and 11.67x bit-operation acceleration.
中文: D2-DPM提出双重去噪机制,有效抵消扩散模型量化噪声影响,在实现显著压缩和加速的同时提升了生成质量。
English: D2-DPM introduces a dual denoising mechanism to counteract quantization noise effects in diffusion models, achieving enhanced generation quality with significant compression and acceleration.

Authors:Marcel Rogge, Didier Stricker
Title: Object-Centric 2D Gaussian Splatting: Background Removal and Occlusion-Aware Pruning for Compact Object Models
Abstract:
Current Gaussian Splatting approaches are effective for reconstructing entire scenes but lack the option to target specific objects, making them computationally expensive and unsuitable for object-specific applications. We propose a novel approach that leverages object masks to enable targeted reconstruction, resulting in object-centric models. Additionally, we introduce an occlusion-aware pruning strategy to minimize the number of Gaussians without compromising quality. Our method reconstructs compact object models, yielding object-centric Gaussian and mesh representations that are up to 96% smaller and up to 71% faster to train compared to the baseline while retaining competitive quality. These representations are immediately usable for downstream applications such as appearance editing and physics simulation without additional processing.
中文摘要:该方法利用物体遮罩和遮挡感知剪枝策略,构建了紧凑的物体中心高斯与网格表示,在保持质量的同时显著减小了模型体积并加速训练,可直接应用于外观编辑和物理模拟等下游任务。
English Summary: The proposed method uses object masks and occlusion-aware pruning to create compact, object-centric Gaussian and mesh representations that are significantly smaller and faster to train while maintaining quality, enabling direct use in applications like appearance editing and physics simulation.

Authors:Hanene F. Z. Brachemi Meftah, Wassim Hamidouche, Sid Ahmed Fezza, Olivier Déforges, Kassem Kallas
Title: Energy Backdoor Attack to Deep Neural Networks
Abstract:
The rise of deep learning (DL) has increased computing complexity and energy use, prompting the adoption of application specific integrated circuits (ASICs) for energy-efficient edge and mobile deployment. However, recent studies have demonstrated the vulnerability of these accelerators to energy attacks. Despite the development of various inference time energy attacks in prior research, backdoor energy attacks remain unexplored. In this paper, we design an innovative energy backdoor attack against deep neural networks (DNNs) operating on sparsity-based accelerators. Our attack is carried out in two distinct phases: backdoor injection and backdoor stealthiness. Experimental results using ResNet-18 and MobileNet-V2 models trained on CIFAR-10 and Tiny ImageNet datasets show the effectiveness of our proposed attack in increasing energy consumption on trigger samples while preserving the model's performance for clean/regular inputs. This demonstrates the vulnerability of DNNs to energy backdoor attacks. The source code of our attack is available at: https://github.com/hbrachemi/energy_backdoor.
中文摘要:本文针对基于稀疏性加速器的深度神经网络提出了一种新型能量后门攻击,该攻击能在保持模型正常性能的同时,显著增加触发样本的能耗。
English Summary: This paper introduces a novel energy backdoor attack targeting deep neural networks on sparsity-based accelerators, which significantly increases energy consumption for trigger inputs while maintaining normal model performance.

Authors:Xudong Wang, Qingbo Hao, Xu Cheng, Yingyuan Xiao
Title: UFGraphFR: Graph Federation Recommendation System based on User Text description features
Abstract:
Federated learning has emerged as a key paradigm in privacy-preserving computing due to its "data usable but not visible" property, enabling users to collaboratively train models without sharing raw data. Motivated by this, federated recommendation systems offer a promising architecture that balances user privacy with recommendation accuracy through distributed collaborative learning. However, existing federated recommendation methods often neglect the underlying semantic or behavioral relationships between users during parameter aggregation, which limits their recommendation effectiveness. To overcome this limitation, graph-based federated recommendation systems have been proposed to leverage neighborhood information. Yet, conventional graph construction methods usually require access to raw user data or explicit social links, which contradicts the strict privacy requirements of federated learning. In this work, we propose UFGraphFR (User Text-feature-based Graph Federated Recommendation), a novel personalized federated recommendation framework that constructs a user graph based on clients' locally embedded text features. Our core assumption is that users with similar textual feature descriptions exhibit similar preferences. Accordingly, UFGraphFR introduces two key components: (1) a privacy-preserving user relationship graph constructed from the joint embedding layer's weight matrix without leaking raw user attributes; (2) a Transformer-based architecture to model temporal dependencies in user-item interaction sequences. Experimental results on benchmark datasets such as MovieLens and HetRec2011 demonstrate that UFGraphFR achieves recommendation accuracy comparable to both centralized and state-of-the-art federated baselines while preserving user privacy. The code is available at: https://github.com/trueWangSyutung/UFGraphFR.
中文:UFGraphFR提出了一种基于嵌入文本特征构建用户图谱的隐私保护联邦推荐框架,通过Transformer架构建模时序依赖,在保护用户隐私的同时实现了与先进方法相当的推荐精度。
English: UFGraphFR introduces a privacy-preserving federated recommendation framework that constructs user graphs from embedded text features and employs Transformer architecture to enhance recommendation accuracy without compromising user data security.

Authors:Xiao Xu, Qiong Wu, Pingyi Fan, Kezhi Wang
Title: Enhanced SPS Velocity-adaptive Scheme: Access Fairness in 5G NR V2I Networks
Abstract:
Vehicle-to-Infrastructure (V2I) technology enables information exchange between vehicles and road infrastructure. Specifically, when a vehicle approaches a roadside unit (RSU), it can exchange information with the RSU to obtain accurate data that assists in driving. With the release of the 3rd Generation Partnership Project (3GPP) Release 16, which includes the 5G New Radio (NR) Vehicle-to-Everything (V2X) standards, vehicles typically adopt mode-2 communication using sensing-based semi-persistent scheduling (SPS) for resource allocation. In this approach, vehicles identify candidate resources within a selection window and exclude ineligible resources based on information from a sensing window. However, vehicles often drive at different speeds, resulting in varying amounts of data transmission with RSUs as they pass by, which leads to unfair access. Therefore, it is essential to design an access scheme that accounts for different vehicle speeds to achieve fair access across the network. This paper formulates an optimization problem for vehicular networks and proposes a multi-objective optimization scheme to address it by adjusting the selection window in the SPS mechanism of 5G NR V2I mode-2. Simulation results demonstrate the effectiveness of the proposed scheme
中文: 车对基础设施通信使车辆与路边单元能够交换数据,但不同车速导致接入不公,因此提出一种多目标优化方案,通过调整5G NR V2I模式2中的选择窗口来确保公平性。
English: Vehicle-to-Infrastructure communication enables data exchange between vehicles and roadside units, but varying speeds cause unfair access, prompting a proposed multi-objective optimization scheme that adjusts the selection window in 5G NR V2I mode-2 to ensure fairness.

Authors:Attila Répai, Sándor Földi, Péter Sótonyi, György Cserey
Title: An Open Source Validation System for Continuous Arterial Blood Pressure Measuring Sensors
Abstract:
Measuring the blood pressure waveform is becoming a more frequently studied area. The development of sensor technologies opens many new ways to be able to measure high-quality signals. The development of such an aim-specific sensor can be time-consuming, expensive, and difficult to test or validate with known and consistent waveforms. In this paper, we present an open source blood pressure waveform simulator with an open source Python validation package to reduce development costs for early-stage sensor development and research. The simulator mainly consists of 3D printed parts which technology has become a widely available and cheap solution. The core part of the simulator is a 3D printed cam that can be generated based on real blood pressure waveforms. The validation framework can create a detailed comparison between the signal waveform used to design the cam and the measured time series from the sensor being validated. The presented simulator proved to be robust and accurate in short- and long-term use, as it produced the signal waveform consistently and accurately. To validate this solution, a 3D force sensor was used, which was proven earlier to be able to measure high-quality blood pressure waveforms on the radial artery at the wrist. The results showed high similarity between the measured and the nominal waveforms, meaning that comparing the normalized signals, the RMSE value ranged from $0.0276 \pm 0.0047$ to $0.0212 \pm 0.0023$, and the Pearson correlation ranged from $0.9933 \pm 0.0027$ to $0.9978 \pm 0.0005$. Our validation framework is available at https://github.com/repat8/cam-bpw-sim. Our hardware framework, which allows reproduction of the presented solution, is available at https://github.com/repat8/cam-bpw-sim-hardware. The entire design is an open source project and was developed using free software.
中文: 本文提出了一种基于3D打印技术的开源血压波形模拟器及Python验证工具包,通过实验证明其具有高精度(均方根误差最低达0.0212,皮尔逊相关系数最高达0.9978),能有效降低传感器研发成本。
English: This paper introduces an open-source blood pressure waveform simulator using 3D printed components and a Python validation package to reduce development costs, demonstrating high accuracy with RMSE values as low as 0.0212 and Pearson correlations up to 0.9978.

Authors:Jiaqi Hua, Wanxu Wei
Title: Self-Instruct Few-Shot Jailbreaking: Decompose the Attack into Pattern and Behavior Learning
Abstract:
Recently, several works have been conducted on jailbreaking Large Language Models (LLMs) with few-shot malicious demos. In particular, Zheng et al. focus on improving the efficiency of Few-Shot Jailbreaking (FSJ) by injecting special tokens into the demos and employing demo-level random search, known as Improved Few-Shot Jailbreaking (I-FSJ). Nevertheless, we notice that this method may still require a long context to jailbreak advanced models e.g. 32 shots of demos for Meta-Llama-3-8B-Instruct (Llama-3) \cite{llama3modelcard}. In this paper, we discuss the limitations of I-FSJ and propose Self-Instruct Few-Shot Jailbreaking (Self-Instruct-FSJ) facilitated with the demo-level greedy search. This framework decomposes the FSJ attack into pattern and behavior learning to exploit the model's vulnerabilities in a more generalized and efficient way. We conduct elaborate experiments to evaluate our method on common open-source models and compare it with baseline algorithms. Our code is available at https://github.com/iphosi/Self-Instruct-FSJ.
中文摘要:近期研究通过注入特殊标记和随机搜索改进了大语言模型的少样本越狱效率,但该方法仍需要较长上下文,因此提出一种将攻击分解为模式与行为学习的新框架以提升效率。
English Summary: Recent research has improved few-shot jailbreaking of LLMs by injecting special tokens and using random search, but it still requires lengthy contexts, leading to a new method that decomposes attacks into pattern and behavior learning for greater efficiency.

Authors:Thibaut Boissin, Franck Mamalet, Thomas Fel, Agustin Martin Picard, Thomas Massena, Mathieu Serrurier
Title: An Adaptive Orthogonal Convolution Scheme for Efficient and Flexible CNN Architectures
Abstract:
Orthogonal convolutional layers are valuable components in multiple areas of machine learning, such as adversarial robustness, normalizing flows, GANs, and Lipschitz-constrained models. Their ability to preserve norms and ensure stable gradient propagation makes them valuable for a large range of problems. Despite their promise, the deployment of orthogonal convolution in large-scale applications is a significant challenge due to computational overhead and limited support for modern features like strides, dilations, group convolutions, and transposed convolutions. In this paper, we introduce AOC (Adaptative Orthogonal Convolution), a scalable method that extends a previous method (BCOP), effectively overcoming existing limitations in the construction of orthogonal convolutions. This advancement unlocks the construction of architectures that were previously considered impractical. We demonstrate through our experiments that our method produces expressive models that become increasingly efficient as they scale. To foster further advancement, we provide an open-source python package implementing this method, called Orthogonium ( https://github.com/deel-ai/orthogonium ) .
中文: AOC是一种可扩展的方法,有效克服了正交卷积的现有局限,实现了高效且表达能力强的模型,适用于大规模应用,并提供了开源实现。
English: AOC is a scalable method that overcomes the limitations of orthogonal convolutions, enabling efficient and expressive models for large-scale applications while providing an open-source implementation.

Authors:Mohamed A. Taha
Title: Logarithmic Memory Networks (LMNs): Efficient Long-Range Sequence Modeling for Resource-Constrained Environments
Abstract:
Long-range sequence modeling is a crucial aspect of natural language processing and time series analysis. However, traditional models like Recurrent Neural Networks (RNNs) and Transformers suffer from computational and memory inefficiencies, especially when dealing with long sequences. This paper introduces Logarithmic Memory Networks (LMNs), a novel architecture that leverages a hierarchical logarithmic tree structure to efficiently store and retrieve past information. LMNs dynamically summarize historical context, significantly reducing the memory footprint and computational complexity of attention mechanisms from O(n2) to O(log(n)). The model employs a single-vector, targeted attention mechanism to access stored information, and the memory block construction worker (summarizer) layer operates in two modes: a parallel execution mode during training for efficient processing of hierarchical tree structures and a sequential execution mode during inference, which acts as a memory management system. It also implicitly encodes positional information, eliminating the need for explicit positional encodings. These features make LMNs a robust and scalable solution for processing long-range sequences in resource-constrained environments, offering practical improvements in efficiency and scalability. The code is publicly available under the MIT License on GitHub: https://github.com/AhmedBoin/LogarithmicMemory.
中文摘要:本文提出对数记忆网络(LMN),通过分层树结构和目标注意力机制将计算复杂度从O(n²)降至O(log(n)),无需显式位置编码即可实现长序列的高效处理。
English Summary: This paper introduces Logarithmic Memory Networks (LMNs), a novel architecture that reduces computational complexity from O(n²) to O(log(n)) through hierarchical tree structures and targeted attention, enabling efficient long-sequence processing without explicit positional encodings.

Authors:Yaowen Ye, Cassidy Laidlaw, Jacob Steinhardt
Title: Iterative Label Refinement Matters More than Preference Optimization under Weak Supervision
Abstract:
Language model (LM) post-training relies on two stages of human supervision: task demonstrations for supervised finetuning (SFT), followed by preference comparisons for reinforcement learning from human feedback (RLHF). As LMs become more capable, the tasks they are given become harder to supervise. Will post-training remain effective under unreliable supervision? To test this, we simulate unreliable demonstrations and comparison feedback using small LMs and time-constrained humans. We find that in the presence of unreliable supervision, SFT still retains some effectiveness, but DPO (a common RLHF algorithm) fails to improve the model beyond SFT. To address this, we propose iterative label refinement (ILR) as an alternative to RLHF. ILR improves the SFT data by using comparison feedback to decide whether human demonstrations should be replaced by model-generated alternatives, then retrains the model via SFT on the updated data. SFT+ILR outperforms SFT+DPO on several tasks with unreliable supervision (math, coding, and safe instruction-following). Our findings suggest that as LMs are used for complex tasks where human supervision is unreliable, RLHF may no longer be the best use of human comparison feedback; instead, it is better to direct feedback towards improving the training data rather than continually training the model. Our code and data are available at https://github.com/helloelwin/iterative-label-refinement.
中文: 当人类监督不可靠时,监督微调(SFT)仍保持部分有效性,但直接偏好优化(DPO)无法进一步优化模型,因此提出迭代标签优化(ILR)作为更优替代方案,通过提升训练数据质量而非持续训练模型来改善性能。
English: When human supervision becomes unreliable, supervised fine-tuning (SFT) remains partially effective, but direct preference optimization (DPO) fails to enhance the model further, prompting the introduction of iterative label refinement (ILR) as a superior alternative that improves training data quality instead of continuous model training.

Authors:Boye Niu, Yiliao Song, Kai Lian, Yifan Shen, Yu Yao, Kun Zhang, Tongliang Liu
Title: Flow: Modularized Agentic Workflow Automation
Abstract:
Multi-agent frameworks powered by large language models (LLMs) have demonstrated great success in automated planning and task execution. However, the effective adjustment of agentic workflows during execution has not been well studied. An effective workflow adjustment is crucial in real-world scenarios, as the initial plan must adjust to unforeseen challenges and changing conditions in real time to ensure the efficient execution of complex tasks. In this paper, we define workflows as an activity-on-vertex (AOV) graph, which allows continuous workflow refinement by LLM agents through dynamic subtask allocation adjustment based on historical performance and previous AOVs. To further enhance framework performance, we emphasize modularity in workflow design based on evaluating parallelism and dependency complexity. With this design, our proposed multi-agent framework achieves efficient concurrent execution of subtasks, effective goal achievement, and enhanced error tolerance. Empirical results across various practical tasks demonstrate significant improvements in the efficiency of multi-agent frameworks through dynamic workflow refinement and modularization. The code is available at: https://github.com/tmllab/2025_ICLR_FLOW.
中文: 本文提出一种多智能体框架,通过LLM智能体动态优化工作流和模块化设计,显著提升了现实场景中任务执行的效率与容错能力。
English: This paper introduces a multi-agent framework that enhances automated task execution by dynamically refining workflows using LLM agents and modular design, significantly improving efficiency and error tolerance in real-world scenarios.

Authors:Yunzhi Zhuge, Hongyu Gu, Lu Zhang, Jinqing Qi, Huchuan Lu
Title: Learning Motion and Temporal Cues for Unsupervised Video Object Segmentation
Abstract:
In this paper, we address the challenges in unsupervised video object segmentation (UVOS) by proposing an efficient algorithm, termed MTNet, which concurrently exploits motion and temporal cues. Unlike previous methods that focus solely on integrating appearance with motion or on modeling temporal relations, our method combines both aspects by integrating them within a unified framework. MTNet is devised by effectively merging appearance and motion features during the feature extraction process within encoders, promoting a more complementary representation. To capture the intricate long-range contextual dynamics and information embedded within videos, a temporal transformer module is introduced, facilitating efficacious inter-frame interactions throughout a video clip. Furthermore, we employ a cascade of decoders all feature levels across all feature levels to optimally exploit the derived features, aiming to generate increasingly precise segmentation masks. As a result, MTNet provides a strong and compact framework that explores both temporal and cross-modality knowledge to robustly localize and track the primary object accurately in various challenging scenarios efficiently. Extensive experiments across diverse benchmarks conclusively show that our method not only attains state-of-the-art performance in unsupervised video object segmentation but also delivers competitive results in video salient object detection. These findings highlight the method's robust versatility and its adeptness in adapting to a range of segmentation tasks. Source code is available on https://github.com/hy0523/MTNet.
中文: 本文提出MTNet算法,通过统一框架整合运动与时间线索进行无监督视频目标分割,能在复杂场景中准确定位主要目标,并取得领先性能。
English: This paper introduces MTNet, an efficient unsupervised video object segmentation algorithm that integrates motion and temporal cues within a unified framework to robustly localize primary objects across challenging scenarios, achieving state-of-the-art performance.

Authors:Zhaokai Wang, Xizhou Zhu, Xue Yang, Gen Luo, Hao Li, Changyao Tian, Wenhan Dou, Junqi Ge, Lewei Lu, Yu Qiao, Jifeng Dai
Title: Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding
Abstract:
Image pyramids are widely adopted in top-performing methods to obtain multi-scale features for precise visual perception and understanding. However, current image pyramids use the same large-scale model to process multiple resolutions of images, leading to significant computational cost. To address this challenge, we propose a novel network architecture, called Parameter-Inverted Image Pyramid Networks (PIIP). Specifically, PIIP uses pretrained models (ViTs or CNNs) as branches to process multi-scale images, where images of higher resolutions are processed by smaller network branches to balance computational cost and performance. To integrate information from different spatial scales, we further propose a novel cross-branch feature interaction mechanism. To validate PIIP, we apply it to various perception models and a representative multimodal large language model called LLaVA, and conduct extensive experiments on various tasks such as object detection, segmentation, image classification and multimodal understanding. PIIP achieves superior performance compared to single-branch and existing multi-resolution approaches with lower computational cost. When applied to InternViT-6B, a large-scale vision foundation model, PIIP can improve its performance by 1%-2% on detection and segmentation with only 40%-60% of the original computation, finally achieving 60.0 box AP on MS COCO and 59.7 mIoU on ADE20K. For multimodal understanding, our PIIP-LLaVA achieves 73.0% accuracy on TextVQA and 74.5% on MMBench with only 2.8M training data. Our code is released at https://github.com/OpenGVLab/PIIP.
中文总结:提出的参数倒置图像金字塔网络通过使用更小的网络分支处理高分辨率图像来降低计算成本,同时利用跨分支特征交互保持性能,在多项视觉任务中以显著减少的计算量实现了更优表现。
English Summary: The proposed Parameter-Inverted Image Pyramid (PIIP) network processes higher-resolution images with smaller network branches to reduce computational costs while maintaining performance through cross-branch feature interactions, achieving superior results across various vision tasks with significantly lower computation.

Authors:Jiacheng Cui, Zhaoyi Li, Xiaochen Ma, Xinyue Bi, Yaxin Luo, Zhiqiang Shen
Title: Dataset Distillation via Committee Voting
Abstract:
Dataset distillation aims to synthesize a smaller, representative dataset that preserves the essential properties of the original data, enabling efficient model training with reduced computational resources. Prior work has primarily focused on improving the alignment or matching process between original and synthetic data, or on enhancing the efficiency of distilling large datasets. In this work, we introduce ${\bf C}$ommittee ${\bf V}$oting for ${\bf D}$ataset ${\bf D}$istillation (CV-DD), a novel and orthogonal approach that leverages the collective wisdom of multiple models or experts to create high-quality distilled datasets. We start by showing how to establish a strong baseline that already achieves state-of-the-art accuracy through leveraging recent advancements and thoughtful adjustments in model design and optimization processes. By integrating distributions and predictions from a committee of models while generating high-quality soft labels, our method captures a wider spectrum of data features, reduces model-specific biases and the adverse effects of distribution shifts, leading to significant improvements in generalization. This voting-based strategy not only promotes diversity and robustness within the distilled dataset but also significantly reduces overfitting, resulting in improved performance on post-eval tasks. Extensive experiments across various datasets and IPCs (images per class) demonstrate that Committee Voting leads to more reliable and adaptable distilled data compared to single/multi-model distillation methods, demonstrating its potential for efficient and accurate dataset distillation. Code is available at: https://github.com/Jiacheng8/CV-DD.
中文: 本文提出委员会投票数据集蒸馏(CV-DD)新方法,通过整合多个模型的预测分布生成高质量软标签,有效增强特征多样性、减少模型偏差,在多个数据集上显著提升泛化性能。
English: This paper introduces Committee Voting for Dataset Distillation (CV-DD), a novel method that synthesizes high-quality distilled datasets by leveraging collective model predictions to enhance feature diversity, reduce bias, and improve generalization across various datasets.

Authors:Varun Biyyala, Bharat Chanderprakash Kathuria, Jialu Li, Youshan Zhang
Title: SST-EM: Advanced Metrics for Evaluating Semantic, Spatial and Temporal Aspects in Video Editing
Abstract:
Video editing models have advanced significantly, but evaluating their performance remains challenging. Traditional metrics, such as CLIP text and image scores, often fall short: text scores are limited by inadequate training data and hierarchical dependencies, while image scores fail to assess temporal consistency. We present SST-EM (Semantic, Spatial, and Temporal Evaluation Metric), a novel evaluation framework that leverages modern Vision-Language Models (VLMs), Object Detection, and Temporal Consistency checks. SST-EM comprises four components: (1) semantic extraction from frames using a VLM, (2) primary object tracking with Object Detection, (3) focused object refinement via an LLM agent, and (4) temporal consistency assessment using a Vision Transformer (ViT). These components are integrated into a unified metric with weights derived from human evaluations and regression analysis. The name SST-EM reflects its focus on Semantic, Spatial, and Temporal aspects of video evaluation. SST-EM provides a comprehensive evaluation of semantic fidelity and temporal smoothness in video editing. The source code is available in the \textbf{\href{https://github.com/custommetrics-sst/SST_CustomEvaluationMetrics.git}{GitHub Repository}}.
中文摘要:SST-EM是一种创新的视频编辑评估框架,通过整合语义提取、目标追踪、精细化处理和时序一致性检测,全面评估视频的语义保真度与时间连贯性。
English Summary: SST-EM is a novel video editing evaluation framework that integrates semantic extraction, object tracking, refinement, and temporal consistency checks to comprehensively assess semantic fidelity and temporal smoothness.

Authors:Shiman Zhang, Lakshmikar Reddy Polamreddy, Youshan Zhang
Title: Confident Pseudo-labeled Diffusion Augmentation for Canine Cardiomegaly Detection
Abstract:
Canine cardiomegaly, marked by an enlarged heart, poses serious health risks if undetected, requiring accurate diagnostic methods. Current detection models often rely on small, poorly annotated datasets and struggle to generalize across diverse imaging conditions, limiting their real-world applicability. To address these issues, we propose a Confident Pseudo-labeled Diffusion Augmentation (CDA) model for identifying canine cardiomegaly. Our approach addresses the challenge of limited high-quality training data by employing diffusion models to generate synthetic X-ray images and annotate Vertebral Heart Score key points, thereby expanding the dataset. We also employ a pseudo-labeling strategy with Monte Carlo Dropout to select high-confidence labels, refine the synthetic dataset, and improve accuracy. Iteratively incorporating these labels enhances the model's performance, overcoming the limitations of existing approaches. Experimental results show that the CDA model outperforms traditional methods, achieving state-of-the-art accuracy in canine cardiomegaly detection. The code implementation is available at https://github.com/Shira7z/CDA.
中文: CDA模型通过扩散模型生成合成X光图像并采用置信伪标注策略,有效解决了标注数据稀缺问题,在大检测准确率上实现了最优性能。
English: The CDA model introduces diffusion-based synthetic X-ray generation and confident pseudo-labeling to overcome limited annotated data, achieving state-of-the-art accuracy in detecting canine cardiomegaly.

Authors:Difei Gu, Yunhe Gao, Yang Zhou, Mu Zhou, Dimitris Metaxas
Title: RadAlign: Advancing Radiology Report Generation with Vision-Language Concept Alignment
Abstract:
Automated chest radiographs interpretation requires both accurate disease classification and detailed radiology report generation, presenting a significant challenge in the clinical workflow. Current approaches either focus on classification accuracy at the expense of interpretability or generate detailed but potentially unreliable reports through image captioning techniques. In this study, we present RadAlign, a novel framework that combines the predictive accuracy of vision-language models (VLMs) with the reasoning capabilities of large language models (LLMs). Inspired by the radiologist's workflow, RadAlign first employs a specialized VLM to align visual features with key medical concepts, achieving superior disease classification with an average AUC of 0.885 across multiple diseases. These recognized medical conditions, represented as text-based concepts in the aligned visual-language space, are then used to prompt LLM-based report generation. Enhanced by a retrieval-augmented generation mechanism that grounds outputs in similar historical cases, RadAlign delivers superior report quality with a GREEN score of 0.678, outperforming state-of-the-art methods' 0.634. Our framework maintains strong clinical interpretability while reducing hallucinations, advancing automated medical imaging and report analysis through integrated predictive and generative AI. Code is available at https://github.com/difeigu/RadAlign.
Chinese: RadAlign是一种创新框架,通过结合视觉语言模型进行精确疾病分类和大型语言模型生成详细放射学报告,以0.885的平均AUC和0.678的GREEN评分实现卓越性能,同时提升临床可解释性。
English: RadAlign is a novel framework that integrates vision-language models for accurate disease classification and large language models for generating detailed radiology reports, achieving superior performance with an AUC of 0.885 and a GREEN score of 0.678 while enhancing clinical interpretability.

Authors:Yaqing Ding, Viktor Kocur, Zuzana Berger Haladová, Qianliang Wu, Shen Cai, Jian Yang, Zuzana Kukelova
Title: Three-view Focal Length Recovery From Homographies
Abstract:
In this paper, we propose a novel approach for recovering focal lengths from three-view homographies. By examining the consistency of normal vectors between two homographies, we derive new explicit constraints between the focal lengths and homographies using an elimination technique. We demonstrate that three-view homographies provide two additional constraints, enabling the recovery of one or two focal lengths. We discuss four possible cases, including three cameras having an unknown equal focal length, three cameras having two different unknown focal lengths, three cameras where one focal length is known, and the other two cameras have equal or different unknown focal lengths. All the problems can be converted into solving polynomials in one or two unknowns, which can be efficiently solved using Sturm sequence or hidden variable technique. Evaluation using both synthetic and real data shows that the proposed solvers are both faster and more accurate than methods relying on existing two-view solvers. The code and data are available on https://github.com/kocurvik/hf
中文: 本文提出了一种通过分析三视图单应矩阵中法向量一致性并利用消元技术推导新约束的新方法,能够高效求解多项式以恢复焦距,实验证明其比现有方法更快更精确。
English: This paper introduces a novel method for recovering focal lengths from three-view homographies by deriving explicit constraints through normal vector consistency and elimination techniques, enabling efficient polynomial solving with demonstrated superior speed and accuracy over existing methods.

Authors:Wenping Jin, Li Zhu, Jing Sun
Title: Aligning First, Then Fusing: A Novel Weakly Supervised Multimodal Violence Detection Method
Abstract:
Weakly supervised violence detection refers to the technique of training models to identify violent segments in videos using only video-level labels. Among these approaches, multimodal violence detection, which integrates modalities such as audio and optical flow, holds great potential. Existing methods in this domain primarily focus on designing multimodal fusion models to address modality discrepancies. In contrast, we take a different approach; leveraging the inherent discrepancies across modalities in violence event representation to propose a novel multimodal semantic feature alignment method. This method sparsely maps the semantic features of local, transient, and less informative modalities ( such as audio and optical flow ) into the more informative RGB semantic feature space. Through an iterative process, the method identifies the suitable no-zero feature matching subspace and aligns the modality-specific event representations based on this subspace, enabling the full exploitation of information from all modalities during the subsequent modality fusion stage. Building on this, we design a new weakly supervised violence detection framework that consists of unimodal multiple-instance learning for extracting unimodal semantic features, multimodal alignment, multimodal fusion, and final detection. Experimental results on benchmark datasets demonstrate the effectiveness of our method, achieving an average precision (AP) of 86.07% on the XD-Violence dataset. Our code is available at https://github.com/xjpp2016/MAVD.
Chinese Summary: 本文提出一种新颖的弱监督暴力检测方法,通过将信息量较少的模态特征稀疏映射至RGB语义空间实现多模态特征对齐,结合单模态学习、特征对齐、多模态融合与检测的框架,在XD-Violence数据集上达到86.07%的平均精度。
English Summary: This paper introduces a novel weakly supervised violence detection method that aligns multimodal semantic features by mapping less informative modalities into the RGB feature space, achieving 86.07% AP on the XD-Violence dataset through a framework combining unimodal learning, alignment, fusion, and detection.

Authors:Denis Lochmelis, Evgenii Moiseenko, Yaroslav Golubev, Anton Podkopaev
Title: LitmusKt: Concurrency Stress Testing for Kotlin
Abstract:
We present LitmusKt - the first tool for litmus testing concurrent programs in Kotlin. The tool's novelty also lies in the fact that Kotlin is a multiplatform language, i.e., it compiles into multiple platforms, which means that the concurrency has to be tested on several of them. Our tool allows writing litmus tests in a single custom DSL, and these tests are then run in Kotlin/Native and Kotlin/JVM, two main platforms for concurrent programming in Kotlin. Using LitmusKt, we discovered novel bugs in the Kotlin compiler, which we then fixed and they are no longer present. Moreover, LitmusKt was integrated into the CI pipeline for Kotlin. LitmusKt is available on GitHub: https://github.com/JetBrains-Research/litmuskt. The demo is available on YouTube: https://youtu.be/oWCZp_Huwss.
中文: LitmusKt是首个用于Kotlin并发程序litmus测试的工具,它允许通过自定义DSL编写测试并在Kotlin/Native和Kotlin/JVM平台上运行,已成功发现并修复了Kotlin编译器中的新错误,并被集成到持续集成流程中。
English: LitmusKt is the first tool designed for litmus testing of concurrent programs in Kotlin, enabling tests to be written in a custom DSL and executed across Kotlin/Native and Kotlin/JVM platforms, which has led to the discovery and resolution of bugs in the Kotlin compiler and its integration into the CI pipeline.

Authors:Fabio Montello, Ronja Güldenring, Simone Scardapane, Lazaros Nalpantidis
Title: A Survey on Dynamic Neural Networks: from Computer Vision to Multi-modal Sensor Fusion
Abstract:
Model compression is essential in the deployment of large Computer Vision models on embedded devices. However, static optimization techniques (e.g. pruning, quantization, etc.) neglect the fact that different inputs have different complexities, thus requiring different amount of computations. Dynamic Neural Networks allow to condition the number of computations to the specific input. The current literature on the topic is very extensive and fragmented. We present a comprehensive survey that synthesizes and unifies existing Dynamic Neural Networks research in the context of Computer Vision. Additionally, we provide a logical taxonomy based on which component of the network is adaptive: the output, the computation graph or the input. Furthermore, we argue that Dynamic Neural Networks are particularly beneficial in the context of Sensor Fusion for better adaptivity, noise reduction and information prioritization. We present preliminary works in this direction. We complement this survey with a curated repository listing all the surveyed papers, each with a brief summary of the solution and the code base when available: https://github.com/DTU-PAS/awesome-dynn-for-cv .
中文摘要:模型压缩对于在嵌入式设备上部署大型计算机视觉模型至关重要,而动态神经网络能根据输入复杂度调整计算量,本综述提出统一分类法并强调其在传感器融合中的优势。
English summary: Model compression is crucial for deploying large computer vision models on embedded devices, and dynamic neural networks adapt computational complexity based on input requirements, with this survey providing a unified taxonomy and highlighting their benefits in sensor fusion.

Authors:Brendan Mallery, James M. Murphy, Shuchin Aeron
Title: Synthesis and Analysis of Data as Probability Measures with Entropy-Regularized Optimal Transport
Abstract:
We consider synthesis and analysis of probability measures using the entropy-regularized Wasserstein-2 cost and its unbiased version, the Sinkhorn divergence. The synthesis problem consists of computing the barycenter, with respect to these costs, of reference measures given a set of coefficients belonging to the simplex. The analysis problem consists of finding the coefficients for the closest barycenter in the Wasserstein-2 distance to a given measure. Under the weakest assumptions on the measures thus far in the literature, we compute the derivative of the entropy-regularized Wasserstein-2 cost. We leverage this to establish a characterization of barycenters with respect to the entropy-regularized Wasserstein-2 cost as solutions that correspond to a fixed point of an average of the entropy-regularized displacement maps. This characterization yields a finite-dimensional, convex, quadratic program for solving the analysis problem when the measure being analyzed is a barycenter with respect to the entropy-regularized Wasserstein-2 cost. We show that these coefficients, as well as the value of the barycenter functional, can be estimated from samples with dimension-independent rates of convergence, and that barycentric coefficients are stable with respect to perturbations in the Wasserstein-2 metric. We employ the barycentric coefficients as features for classification of corrupted point cloud data, and show that compared to neural network baselines, our approach is more efficient in small training data regimes.
中文: 本研究利用熵正则化Wasserstein-2成本与Sinkhorn散度进行概率测度的综合与分析,建立了重心特征描述,并在小样本场景下展示了优于神经网络的高效分类应用。
English: This study explores the synthesis and analysis of probability measures using entropy-regularized Wasserstein-2 cost and Sinkhorn divergence, establishing barycenter characterizations and demonstrating efficient classification applications with superior performance in data-scarce scenarios.

Authors:Daniel Steininger, Julia Simon, Andreas Trondl, Markus Murschitz
Title: TimberVision: A Multi-Task Dataset and Framework for Log-Component Segmentation and Tracking in Autonomous Forestry Operations
Abstract:
Timber represents an increasingly valuable and versatile resource. However, forestry operations such as harvesting, handling and measuring logs still require substantial human labor in remote environments posing significant safety risks. Progressively automating these tasks has the potential of increasing their efficiency as well as safety, but requires an accurate detection of individual logs as well as live trees and their context. Although initial approaches have been proposed for this challenging application domain, specialized data and algorithms are still too scarce to develop robust solutions. To mitigate this gap, we introduce the TimberVision dataset, consisting of more than 2k annotated RGB images containing a total of 51k trunk components including cut and lateral surfaces, thereby surpassing any existing dataset in this domain in terms of both quantity and detail by a large margin. Based on this data, we conduct a series of ablation experiments for oriented object detection and instance segmentation and evaluate the influence of multiple scene parameters on model performance. We introduce a generic framework to fuse the components detected by our models for both tasks into unified trunk representations. Furthermore, we automatically derive geometric properties and apply multi-object tracking to further enhance robustness. Our detection and tracking approach provides highly descriptive and accurate trunk representations solely from RGB image data, even under challenging environmental conditions. Our solution is suitable for a wide range of application scenarios and can be readily combined with other sensor modalities.
中文: TimberVision数据集通过提供包含5.1万个树干部件的2000多张标注图像,解决了林业自动化中专业数据匮乏的问题,利用先进的计算机视觉技术实现了对原木和树木的精准检测与追踪。
English: The TimberVision dataset addresses the scarcity of specialized data for automating forestry tasks by providing over 2,000 annotated images with 51,000 trunk components, enabling robust detection and tracking of logs and trees through advanced computer vision techniques.

Authors:Haochuan Zhang, Chunhua Yang, Jie Han, Liyang Qin, Xiaoli Wang
Title: TempoGPT: Enhancing Time Series Reasoning via Quantizing Embedding
Abstract:
Multi-modal language model has made advanced progress in vision and audio, but still faces significant challenges in dealing with complex reasoning tasks in the time series domain. The reasons are twofold. First, labels for multi-modal time series data are coarse and devoid of analysis or reasoning processes. Training with these data cannot improve the model's reasoning capabilities. Second, due to the lack of precise tokenization in processing time series, the representation patterns for temporal and textual information are inconsistent, which hampers the effectiveness of multi-modal alignment. To address these challenges, we propose a multi-modal time series data construction approach and a multi-modal time series language model (TLM), TempoGPT. Specially, we construct multi-modal data for complex reasoning tasks by analyzing the variable-system relationships within a white-box system. Additionally, proposed TempoGPT achieves consistent representation between temporal and textual information by quantizing temporal embeddings, where temporal embeddings are quantized into a series of discrete tokens using a predefined codebook; subsequently, a shared embedding layer processes both temporal and textual tokens. Extensive experiments demonstrate that TempoGPT accurately perceives temporal information, logically infers conclusions, and achieves state-of-the-art in the constructed complex time series reasoning tasks. Moreover, we quantitatively demonstrate the effectiveness of quantizing temporal embeddings in enhancing multi-modal alignment and the reasoning capabilities of TLMs. Code and data are available at https://github.com/zhanghaochuan20/TempoGPT.
中文: 多模态语言模型在时间序列复杂推理中面临标注粗糙和表征不一致的挑战,而提出的TempoGPT通过构建白盒系统数据和时间嵌入量化,实现了时序与文本信息对齐,显著提升了推理性能。
English: Multi-modal language models struggle with complex time series reasoning due to coarse labels and inconsistent temporal-textual representations, but the proposed TempoGPT addresses these by constructing specialized data and quantizing temporal embeddings for improved alignment and performance.

Authors:A. Erkhov, A. Bazhenov, S. Satsevich, D. Belov, F. Khabibullin, S. Egorov, M. Gromakov, M. Altamirano Cabrera, D. Tsetserukou
Title: ViewVR: Visual Feedback Modes to Achieve Quality of VR-based Telemanipulation
Abstract:
The paper focuses on an immersive teleoperation system that enhances operator's ability to actively perceive the robot's surroundings. A consumer-grade HTC Vive VR system was used to synchronize the operator's hand and head movements with a UR3 robot and a custom-built robotic head with two degrees of freedom (2-DoF). The system's usability, manipulation efficiency, and intuitiveness of control were evaluated in comparison with static head camera positioning across three distinct tasks. Code and other supplementary materials can be accessed by link: https://github.com/ErkhovArtem/ViewVR
本文介绍了一种沉浸式遥操作系统,利用HTC Vive VR设备将人体动作与机器人同步,相比静态摄像头方法,显著提升了环境感知能力和操控效率。
This paper presents an immersive teleoperation system using an HTC Vive VR setup to synchronize human movements with a robot, improving perception and control efficiency compared to static camera methods.

Authors:Zhimeng Xin, Tianxu Wu, Shiming Chen, Shuo Ye, Zijing Xie, Yixiong Zou, Xinge You, Yufei Guo
Title: Toward Realistic Camouflaged Object Detection: Benchmarks and Method
Abstract:
Camouflaged object detection (COD) primarily relies on semantic or instance segmentation methods. While these methods have made significant advancements in identifying the contours of camouflaged objects, they may be inefficient or cost-effective for tasks that only require the specific location of the object. Object detection algorithms offer an optimized solution for Realistic Camouflaged Object Detection (RCOD) in such cases. However, detecting camouflaged objects remains a formidable challenge due to the high degree of similarity between the features of the objects and their backgrounds. Unlike segmentation methods that perform pixel-wise comparisons to differentiate between foreground and background, object detectors omit this analysis, further aggravating the challenge. To solve this problem, we propose a camouflage-aware feature refinement (CAFR) strategy. Since camouflaged objects are not rare categories, CAFR fully utilizes a clear perception of the current object within the prior knowledge of large models to assist detectors in deeply understanding the distinctions between background and foreground. Specifically, in CAFR, we introduce the Adaptive Gradient Propagation (AGP) module that fine-tunes all feature extractor layers in large detection models to fully refine class-specific features from camouflaged contexts. We then design the Sparse Feature Refinement (SFR) module that optimizes the transformer-based feature extractor to focus primarily on capturing class-specific features in camouflaged scenarios. To facilitate the assessment of RCOD tasks, we manually annotate the labels required for detection on three existing segmentation COD datasets, creating a new benchmark for RCOD tasks. Code and datasets are available at: https://github.com/zhimengXin/RCOD.
中文: 提出的伪装感知特征细化(CAFR)策略利用大模型的先验知识优化特征提取,有效区分伪装物体与背景,解决了分割方法在定位任务中的不足,并为现实伪装物体检测建立了新基准。
English: The proposed camouflage-aware feature refinement (CAFR) strategy enhances object detection for camouflaged objects by leveraging large models' prior knowledge and refining class-specific features, addressing the limitations of segmentation methods and establishing a new benchmark for realistic camouflaged object detection.

Authors:Junhao Zheng, Chengming Shi, Xidi Cai, Qiuke Li, Duzhen Zhang, Chenxing Li, Dong Yu, Qianli Ma
Title: Lifelong Learning of Large Language Model based Agents: A Roadmap
Abstract:
Lifelong learning, also known as continual or incremental learning, is a crucial component for advancing Artificial General Intelligence (AGI) by enabling systems to continuously adapt in dynamic environments. While large language models (LLMs) have demonstrated impressive capabilities in natural language processing, existing LLM agents are typically designed for static systems and lack the ability to adapt over time in response to new challenges. This survey is the first to systematically summarize the potential techniques for incorporating lifelong learning into LLM-based agents. We categorize the core components of these agents into three modules: the perception module for multimodal input integration, the memory module for storing and retrieving evolving knowledge, and the action module for grounded interactions with the dynamic environment. We highlight how these pillars collectively enable continuous adaptation, mitigate catastrophic forgetting, and improve long-term performance. This survey provides a roadmap for researchers and practitioners working to develop lifelong learning capabilities in LLM agents, offering insights into emerging trends, evaluation metrics, and application scenarios. Relevant literature and resources are available at \href{this url}{https://github.com/qianlima-lab/awesome-lifelong-llm-agent}.
中文: 本综述首次系统性地总结了将终身学习融入基于大语言模型的智能体的技术,通过感知、记忆和行动三大模块实现动态环境中的持续适应,同时防止灾难性遗忘。
English: This survey systematically outlines techniques for integrating lifelong learning into LLM-based agents through perception, memory, and action modules to enable continuous adaptation in dynamic environments while mitigating catastrophic forgetting.

Authors:Wenyan Xu, Jiayu Chen, Dawei Xiang, Chen Li, Yonghong Hu, Zhonghua Lu
Title: Mining Intraday Risk Factor Collections via Hierarchical Reinforcement Learning based on Transferred Options
Abstract:
Traditional risk factors like beta, size/value, and momentum often lag behind market dynamics in measuring and predicting stock return volatility. Statistical models like PCA and factor analysis fail to capture hidden nonlinear relationships. Genetic programming (GP) can identify nonlinear factors but often lacks mechanisms for evaluating factor quality, and the resulting formulas are complex. To address these challenges, we propose a Hierarchical Proximal Policy Optimization (HPPO) framework for automated factor generation and evaluation. HPPO uses two PPO models: a high-level policy assigns weights to stock features, and a low-level policy identifies latent nonlinear relationships. The Pearson correlation between generated factors and return volatility serves as the reward signal. Transfer learning pre-trains the high-level policy on large-scale historical data, fine-tuning it with the latest data to adapt to new features and shifts. Experiments show the HPPO-TO algorithm achieves a 25\% excess return in HFT markets across China (CSI 300/800), India (Nifty 100), and the US (S\&P 500). Code and data are available at https://github.com/wencyxu/HRL-HF_risk_factor_set.
中文摘要:提出的分层近端策略优化(HPPO)框架通过自动生成和评估非线性风险因子,克服了传统方法和统计模型的局限,在多个国家的高频交易市场中实现了25%的超额收益。
English Summary: The proposed Hierarchical Proximal Policy Optimization (HPPO) framework overcomes limitations of traditional and statistical methods by automatically generating and evaluating nonlinear risk factors, achieving a 25% excess return in high-frequency trading markets across multiple countries.

Authors:Li Liang, Naveed Akhtar, Jordan Vice, Xiangrui Kong, Ajmal Saeed Mian
Title: Skip Mamba Diffusion for Monocular 3D Semantic Scene Completion
Abstract:
3D semantic scene completion is critical for multiple downstream tasks in autonomous systems. It estimates missing geometric and semantic information in the acquired scene data. Due to the challenging real-world conditions, this task usually demands complex models that process multi-modal data to achieve acceptable performance. We propose a unique neural model, leveraging advances from the state space and diffusion generative modeling to achieve remarkable 3D semantic scene completion performance with monocular image input. Our technique processes the data in the conditioned latent space of a variational autoencoder where diffusion modeling is carried out with an innovative state space technique. A key component of our neural network is the proposed Skimba (Skip Mamba) denoiser, which is adept at efficiently processing long-sequence data. The Skimba diffusion model is integral to our 3D scene completion network, incorporating a triple Mamba structure, dimensional decomposition residuals and varying dilations along three directions. We also adopt a variant of this network for the subsequent semantic segmentation stage of our method. Extensive evaluation on the standard SemanticKITTI and SSCBench-KITTI360 datasets show that our approach not only outperforms other monocular techniques by a large margin, it also achieves competitive performance against stereo methods. The code is available at https://github.com/xrkong/skimba
中文摘要:本文提出了一种新颖的神经网络模型,利用状态空间和扩散生成建模技术,仅通过单目图像输入即可实现卓越的3D语义场景补全效果,大幅超越现有单目方法并与立体方法相媲美。
English Summary: This paper introduces a novel neural model that employs state space and diffusion generative modeling to achieve superior 3D semantic scene completion from monocular images, significantly outperforming existing monocular methods and competing with stereo approaches.

Authors:Chong Zhou, Chenchen Zhu, Yunyang Xiong, Saksham Suri, Fanyi Xiao, Lemeng Wu, Raghuraman Krishnamoorthi, Bo Dai, Chen Change Loy, Vikas Chandra, Bilge Soran
Title: EdgeTAM: On-Device Track Anything Model
Abstract:
On top of Segment Anything Model (SAM), SAM 2 further extends its capability from image to video inputs through a memory bank mechanism and obtains a remarkable performance compared with previous methods, making it a foundation model for video segmentation task. In this paper, we aim at making SAM 2 much more efficient so that it even runs on mobile devices while maintaining a comparable performance. Despite several works optimizing SAM for better efficiency, we find they are not sufficient for SAM 2 because they all focus on compressing the image encoder, while our benchmark shows that the newly introduced memory attention blocks are also the latency bottleneck. Given this observation, we propose EdgeTAM, which leverages a novel 2D Spatial Perceiver to reduce the computational cost. In particular, the proposed 2D Spatial Perceiver encodes the densely stored frame-level memories with a lightweight Transformer that contains a fixed set of learnable queries. Given that video segmentation is a dense prediction task, we find preserving the spatial structure of the memories is essential so that the queries are split into global-level and patch-level groups. We also propose a distillation pipeline that further improves the performance without inference overhead. As a result, EdgeTAM achieves 87.7, 70.0, 72.3, and 71.7 J&F on DAVIS 2017, MOSE, SA-V val, and SA-V test, while running at 16 FPS on iPhone 15 Pro Max.
Chinese: SAM 2通过内存库机制将SAM扩展至视频分割,而EdgeTAM采用新型2D空间感知器和蒸馏技术优化其效率,使其在iPhone 15 Pro Max上以16 FPS运行并保持高性能。
English: SAM 2 enhances video segmentation by extending SAM with a memory bank mechanism, and EdgeTAM optimizes its efficiency for mobile devices using a 2D Spatial Perceiver and distillation, achieving high performance at 16 FPS on iPhone 15 Pro Max.

Authors:Jie Tan, Yu Rong, Kangfei Zhao, Tian Bian, Tingyang Xu, Junzhou Huang, Hong Cheng, Helen Meng
Title: Natural Language-Assisted Multi-modal Medication Recommendation
Abstract:
Combinatorial medication recommendation(CMR) is a fundamental task of healthcare, which offers opportunities for clinical physicians to provide more precise prescriptions for patients with intricate health conditions, particularly in the scenarios of long-term medical care. Previous research efforts have sought to extract meaningful information from electronic health records (EHRs) to facilitate combinatorial medication recommendations. Existing learning-based approaches further consider the chemical structures of medications, but ignore the textual medication descriptions in which the functionalities are clearly described. Furthermore, the textual knowledge derived from the EHRs of patients remains largely underutilized. To address these issues, we introduce the Natural Language-Assisted Multi-modal Medication Recommendation(NLA-MMR), a multi-modal alignment framework designed to learn knowledge from the patient view and medication view jointly. Specifically, NLA-MMR formulates CMR as an alignment problem from patient and medication modalities. In this vein, we employ pretrained language models(PLMs) to extract in-domain knowledge regarding patients and medications, serving as the foundational representation for both modalities. In the medication modality, we exploit both chemical structures and textual descriptions to create medication representations. In the patient modality, we generate the patient representations based on textual descriptions of diagnosis, procedure, and symptom. Extensive experiments conducted on three publicly accessible datasets demonstrate that NLA-MMR achieves new state-of-the-art performance, with a notable average improvement of 4.72% in Jaccard score. Our source code is publicly available on https://github.com/jtan1102/NLA-MMR_CIKM_2024.
中文摘要:本研究提出的NLA-MMR多模态框架通过预训练语言模型联合学习患者与药物知识,利用药物化学结构和文本描述增强组合用药推荐效果,在三个公开数据集上实现了4.72%的杰卡德指数平均提升。
English Summary: This study introduces NLA-MMR, a multi-modal framework that enhances combinatorial medication recommendations by aligning patient and medication knowledge through pretrained language models, achieving state-of-the-art performance with a 4.72% average improvement in Jaccard score.

Authors:Jinlin Li, Xiao Zhou
Title: CureGraph: Contrastive Multi-Modal Graph Representation Learning for Urban Living Circle Health Profiling and Prediction
Abstract:
The early detection and prediction of health status decline among the elderly at the neighborhood level are of great significance for urban planning and public health policymaking. While existing studies affirm the connection between living environments and health outcomes, most rely on single data modalities or simplistic feature concatenation of multi-modal information, limiting their ability to comprehensively profile the health-oriented urban environments. To fill this gap, we propose CureGraph, a contrastive multi-modal representation learning framework for urban health prediction that employs graph-based techniques to infer the prevalence of common chronic diseases among the elderly within the urban living circles of each neighborhood. CureGraph leverages rich multi-modal information, including photos and textual reviews of residential areas and their surrounding points of interest, to generate urban neighborhood embeddings. By integrating pre-trained visual and textual encoders with graph modeling techniques, CureGraph captures cross-modal spatial dependencies, offering a comprehensive understanding of urban environments tailored to elderly health considerations. Extensive experiments on real-world datasets demonstrate that CureGraph improves the best baseline by $28\%$ on average in terms of $R^2$ across elderly disease risk prediction tasks. Moreover, the model enables the identification of stage-wise chronic disease progression and supports comparative public health analysis across neighborhoods, offering actionable insights for sustainable urban development and enhanced quality of life. The code is publicly available at https://github.com/jinlin2021/CureGraph.
Chinese: CureGraph是一种基于图技术的多模态对比学习框架,通过整合视觉与文本数据预测社区老年人慢性病风险,实验显示其预测性能比最佳基线平均提升28%,为城市健康规划提供有效参考。
English: CureGraph is a contrastive multi-modal framework that uses graph-based techniques and multi-modal data to predict elderly chronic disease risks at the neighborhood level, improving baseline predictions by 28% and providing insights for urban health planning.

Authors:Csaba Tóth, Danilo Jr Dela Cruz, Harald Oberhauser
Title: A User's Guide to $\texttt{KSig}$: GPU-Accelerated Computation of the Signature Kernel
Abstract:
The signature kernel is a positive definite kernel for sequential and temporal data that has become increasingly popular in machine learning applications due to powerful theoretical guarantees, strong empirical performance, and recently introduced various scalable variations. In this chapter, we give a short introduction to $\texttt{KSig}$, a $\texttt{Scikit-Learn}$ compatible Python package that implements various GPU-accelerated algorithms for computing signature kernels, and performing downstream learning tasks. We also introduce a new algorithm based on tensor sketches which gives strong performance compared to existing algorithms. The package is available at https://github.com/tgcsaba/ksig.
中文: 签名核是处理序列数据的高效工具,KSig软件包提供了GPU加速的签名核计算和学习任务实现,其中包含基于张量素描的新算法,性能优异。
English: The signature kernel is a highly effective tool for sequential data analysis, and the KSig package offers GPU-accelerated implementations for computing these kernels and performing learning tasks, including a new tensor sketch-based algorithm with strong performance.

Authors:Han Liu, Yinwei Wei, Fan Liu, Wenjie Wang, Liqiang Nie, Tat-Seng Chua
Title: Dynamic Multimodal Fusion via Meta-Learning Towards Micro-Video Recommendation
Abstract:
Multimodal information (e.g., visual, acoustic, and textual) has been widely used to enhance representation learning for micro-video recommendation. For integrating multimodal information into a joint representation of micro-video, multimodal fusion plays a vital role in the existing micro-video recommendation approaches. However, the static multimodal fusion used in previous studies is insufficient to model the various relationships among multimodal information of different micro-videos. In this paper, we develop a novel meta-learning-based multimodal fusion framework called Meta Multimodal Fusion (MetaMMF), which dynamically assigns parameters to the multimodal fusion function for each micro-video during its representation learning. Specifically, MetaMMF regards the multimodal fusion of each micro-video as an independent task. Based on the meta information extracted from the multimodal features of the input task, MetaMMF parameterizes a neural network as the item-specific fusion function via a meta learner. We perform extensive experiments on three benchmark datasets, demonstrating the significant improvements over several state-of-the-art multimodal recommendation models, like MMGCN, LATTICE, and InvRL. Furthermore, we lighten our model by adopting canonical polyadic decomposition to improve the training efficiency, and validate its effectiveness through experimental results. Codes are available at https://github.com/hanliu95/MetaMMF.
中文: 本文提出MetaMMF框架,通过元学习动态融合各微视频的多模态信息以提升推荐性能,在超越现有先进模型的同时保持了训练效率。
English: This paper introduces MetaMMF, a dynamic meta-learning framework that adaptively fuses multimodal information for each micro-video to improve recommendation accuracy, outperforming existing models while maintaining training efficiency.

Authors:Jason Du, Kelly Hong, Alishba Imran, Erfan Jahanparast, Mehdi Khfifi, Kaichun Qiao
Title: How GPT learns layer by layer
Abstract:
Large Language Models (LLMs) excel at tasks like language processing, strategy games, and reasoning but struggle to build generalizable internal representations essential for adaptive decision-making in agents. For agents to effectively navigate complex environments, they must construct reliable world models. While LLMs perform well on specific benchmarks, they often fail to generalize, leading to brittle representations that limit their real-world effectiveness. Understanding how LLMs build internal world models is key to developing agents capable of consistent, adaptive behavior across tasks. We analyze OthelloGPT, a GPT-based model trained on Othello gameplay, as a controlled testbed for studying representation learning. Despite being trained solely on next-token prediction with random valid moves, OthelloGPT shows meaningful layer-wise progression in understanding board state and gameplay. Early layers capture static attributes like board edges, while deeper layers reflect dynamic tile changes. To interpret these representations, we compare Sparse Autoencoders (SAEs) with linear probes, finding that SAEs offer more robust, disentangled insights into compositional features, whereas linear probes mainly detect features useful for classification. We use SAEs to decode features related to tile color and tile stability, a previously unexamined feature that reflects complex gameplay concepts like board control and long-term planning. We study the progression of linear probe accuracy and tile color using both SAE's and linear probes to compare their effectiveness at capturing what the model is learning. Although we begin with a smaller language model, OthelloGPT, this study establishes a framework for understanding the internal representations learned by GPT models, transformers, and LLMs more broadly. Our code is publicly available: https://github.com/ALT-JS/OthelloSAE.
中文摘要:大语言模型在构建可泛化的内部表征方面存在不足,但通过分析OthelloGPT发现,其层级递进结构和稀疏自编码器能有效解码棋盘状态与棋子稳定性等复杂游戏特征。
English Summary: Large Language Models struggle with building generalizable internal representations for adaptive decision-making, but analyzing OthelloGPT reveals how layer-wise progression and Sparse Autoencoders can decode meaningful gameplay features like tile stability.

Authors:Sieun Hyeon, Kyudan Jung, Nam-Joon Kim, Hyun Gon Ryu, Jaeyoung Do
Title: MathReader : Text-to-Speech for Mathematical Documents
Abstract:
TTS (Text-to-Speech) document reader from Microsoft, Adobe, Apple, and OpenAI have been serviced worldwide. They provide relatively good TTS results for general plain text, but sometimes skip contents or provide unsatisfactory results for mathematical expressions. This is because most modern academic papers are written in LaTeX, and when LaTeX formulas are compiled, they are rendered as distinctive text forms within the document. However, traditional TTS document readers output only the text as it is recognized, without considering the mathematical meaning of the formulas. To address this issue, we propose MathReader, which effectively integrates OCR, a fine-tuned T5 model, and TTS. MathReader demonstrated a lower Word Error Rate (WER) than existing TTS document readers, such as Microsoft Edge and Adobe Acrobat, when processing documents containing mathematical formulas. MathReader reduced the WER from 0.510 to 0.281 compared to Microsoft Edge, and from 0.617 to 0.281 compared to Adobe Acrobat. This will significantly contribute to alleviating the inconvenience faced by users who want to listen to documents, especially those who are visually impaired. The code is available at https://github.com/hyeonsieun/MathReader.
中文: 现有TTS文档阅读器在处理含LaTeX数学公式的学术论文时效果不佳,因此提出的MathReader系统整合OCR、优化T5模型和语音合成技术,显著降低误读率并提升视障用户的文档聆听体验。
English: Current TTS document readers often fail to accurately process mathematical formulas in LaTeX-based academic papers, so the proposed MathReader system combines OCR, a fine-tuned T5 model, and TTS to significantly reduce word error rates and improve accessibility for visually impaired users.

Authors:Yongyu Mu, Hengyu Li, Junxin Wang, Xiaoxuan Zhou, Chenglong Wang, Yingfeng Luo, Qiaozhi He, Tong Xiao, Guocheng Chen, Jingbo Zhu
Title: Boosting Text-To-Image Generation via Multilingual Prompting in Large Multimodal Models
Abstract:
Previous work on augmenting large multimodal models (LMMs) for text-to-image (T2I) generation has focused on enriching the input space of in-context learning (ICL). This includes providing a few demonstrations and optimizing image descriptions to be more detailed and logical. However, as demand for more complex and flexible image descriptions grows, enhancing comprehension of input text within the ICL paradigm remains a critical yet underexplored area. In this work, we extend this line of research by constructing parallel multilingual prompts aimed at harnessing the multilingual capabilities of LMMs. More specifically, we translate the input text into several languages and provide the models with both the original text and the translations. Experiments on two LMMs across 3 benchmarks show that our method, PMT2I, achieves superior performance in general, compositional, and fine-grained assessments, especially in human preference alignment. Additionally, with its advantage of generating more diverse images, PMT2I significantly outperforms baseline prompts when incorporated with reranking methods. Our code and parallel multilingual data can be found at https://github.com/takagi97/PMT2I.
Chinese: 本研究提出PMT2I方法,通过为大型多模态模型提供平行多语言提示来增强文本到图像的生成,在通用、组合和细粒度评估中表现优异,并能生成更多样化的图像。
English: This study introduces PMT2I, a method that enhances text-to-image generation by providing parallel multilingual prompts to large multimodal models, achieving superior performance in general, compositional, and fine-grained assessments while generating more diverse images.

Authors:Jiayang Wu, Wensheng Gan, Jiahao Zhang, Philip S. Yu
Title: ADKGD: Anomaly Detection in Knowledge Graphs with Dual-Channel Training
Abstract:
In the current development of large language models (LLMs), it is important to ensure the accuracy and reliability of the underlying data sources. LLMs are critical for various applications, but they often suffer from hallucinations and inaccuracies due to knowledge gaps in the training data. Knowledge graphs (KGs), as a powerful structural tool, could serve as a vital external information source to mitigate the aforementioned issues. By providing a structured and comprehensive understanding of real-world data, KGs enhance the performance and reliability of LLMs. However, it is common that errors exist in KGs while extracting triplets from unstructured data to construct KGs. This could lead to degraded performance in downstream tasks such as question-answering and recommender systems. Therefore, anomaly detection in KGs is essential to identify and correct these errors. This paper presents an anomaly detection algorithm in knowledge graphs with dual-channel learning (ADKGD). ADKGD leverages a dual-channel learning approach to enhance representation learning from both the entity-view and triplet-view perspectives. Furthermore, using a cross-layer approach, our framework integrates internal information aggregation and context information aggregation. We introduce a kullback-leibler (KL)-loss component to improve the accuracy of the scoring function between the dual channels. To evaluate ADKGD's performance, we conduct empirical studies on three real-world KGs: WN18RR, FB15K, and NELL-995. Experimental results demonstrate that ADKGD outperforms the state-of-the-art anomaly detection algorithms. The source code and datasets are publicly available at https://github.com/csjywu1/ADKGD.
Chinese: 本文提出ADKGD,一种双通道学习的知识图谱异常检测算法,旨在提高知识图谱准确性,从而增强大型语言模型的可靠性。
English: This paper introduces ADKGD, a dual-channel learning algorithm for detecting anomalies in knowledge graphs to improve their accuracy and enhance the reliability of large language models.

Authors:Zhen Xiong, Yuqi Li, Chuanguang Yang, Tiao Tan, Zhihong Zhu, Siyuan Li, Yue Ma
Title: Enhancing Image Generation Fidelity via Progressive Prompts
Abstract:
The diffusion transformer (DiT) architecture has attracted significant attention in image generation, achieving better fidelity, performance, and diversity. However, most existing DiT - based image generation methods focus on global - aware synthesis, and regional prompt control has been less explored. In this paper, we propose a coarse - to - fine generation pipeline for regional prompt - following generation. Specifically, we first utilize the powerful large language model (LLM) to generate both high - level descriptions of the image (such as content, topic, and objects) and low - level descriptions (such as details and style). Then, we explore the influence of cross - attention layers at different depths. We find that deeper layers are always responsible for high - level content control, while shallow layers handle low - level content control. Various prompts are injected into the proposed regional cross - attention control for coarse - to - fine generation. By using the proposed pipeline, we enhance the controllability of DiT - based image generation. Extensive quantitative and qualitative results show that our pipeline can improve the performance of the generated images.
中文: 本文提出了一种从粗到细的生成流程,利用大语言模型和区域交叉注意力控制来增强基于扩散变换器的图像生成可控性,从而优化图像的高级内容和低级细节表现。
English: This paper introduces a coarse-to-fine pipeline using large language models and regional cross-attention control to enhance the controllability of diffusion transformer-based image generation, improving both high-level content and low-level details.

Authors:Minhui Xie, Hao Peng, Pu Li, Guangjie Zeng, Shuhai Wang, Jia Wu, Peng Li, Philip S. Yu
Title: Hierarchical Superpixel Segmentation via Structural Information Theory
Abstract:
Superpixel segmentation is a foundation for many higher-level computer vision tasks, such as image segmentation, object recognition, and scene understanding. Existing graph-based superpixel segmentation methods typically concentrate on the relationships between a given pixel and its directly adjacent pixels while overlooking the influence of non-adjacent pixels. These approaches do not fully leverage the global information in the graph, leading to suboptimal segmentation quality. To address this limitation, we present SIT-HSS, a hierarchical superpixel segmentation method based on structural information theory. Specifically, we first design a novel graph construction strategy that incrementally explores the pixel neighborhood to add edges based on 1-dimensional structural entropy (1D SE). This strategy maximizes the retention of graph information while avoiding an overly complex graph structure. Then, we design a new 2D SE-guided hierarchical graph partitioning method, which iteratively merges pixel clusters layer by layer to reduce the graph's 2D SE until a predefined segmentation scale is achieved. Experimental results on three benchmark datasets demonstrate that the SIT-HSS performs better than state-of-the-art unsupervised superpixel segmentation algorithms. The source code is available at \url{https://github.com/SELGroup/SIT-HSS}.
中文: 提出的SIT-HSS方法通过结构信息理论探索相邻和非相邻像素关系,改进了超像素分割,在性能上超越了现有无监督算法。
English: The proposed SIT-HSS method enhances superpixel segmentation by incorporating structural information theory to explore both adjacent and non-adjacent pixel relationships, achieving superior performance over existing unsupervised algorithms.

Authors:Yan Zhang, Haoqi Li, Ramtin Tabatabaei, Wafa Johal
Title: ROSAnnotator: A Web Application for ROSBag Data Analysis in Human-Robot Interaction
Abstract:
Human-robot interaction (HRI) is an interdisciplinary field that utilises both quantitative and qualitative methods. While ROSBags, a file format within the Robot Operating System (ROS), offer an efficient means of collecting temporally synched multimodal data in empirical studies with real robots, there is a lack of tools specifically designed to integrate qualitative coding and analysis functions with ROSBags. To address this gap, we developed ROSAnnotator, a web-based application that incorporates a multimodal Large Language Model (LLM) to support both manual and automated annotation of ROSBag data. ROSAnnotator currently facilitates video, audio, and transcription annotations and provides an open interface for custom ROS messages and tools. By using ROSAnnotator, researchers can streamline the qualitative analysis process, create a more cohesive analysis pipeline, and quickly access statistical summaries of annotations, thereby enhancing the overall efficiency of HRI data analysis. https://github.com/CHRI-Lab/ROSAnnotator
中文摘要:ROSBags虽能高效收集多模态数据,但缺乏定性分析工具,为此我们开发了基于多模态大语言模型的网络应用ROSAnnotator,支持视频、音频和文本标注,可显著提升人机交互数据分析效率。
English Summary: ROSAnnotator is a web-based tool that integrates multimodal LLM capabilities with ROSBags to enable both manual and automated annotation of HRI data, streamlining qualitative analysis and enhancing research efficiency.

Authors:Jianming Tong, Tianhao Huang, Leo de Castro, Anirudh Itagi, Jingtian Dang, Anupam Golder, Asra Ali, Jevin Jiang, Arvind, G. Edward Suh, Tushar Krishna
Title: Leveraging ASIC AI Chips for Homomorphic Encryption
Abstract:
Cloud-based services are making the outsourcing of sensitive client data increasingly common. Although homomorphic encryption (HE) offers strong privacy guarantee, it requires substantially more resources than computing on plaintext, often leading to unacceptably large latencies in getting the results. HE accelerators have emerged to mitigate this latency issue, but with the high cost of ASICs. In this paper we show that HE primitives can be converted to AI operators and accelerated on existing ASIC AI accelerators, like TPUs, which are already widely deployed in the cloud. Adapting such accelerators for HE requires (1) supporting modular multiplication, (2) high-precision arithmetic in software, and (3) efficient mapping on matrix engines. We introduce the CROSS compiler (1) to adopt Barrett reduction to provide modular reduction support using multiplier and adder, (2) Basis Aligned Transformation (BAT) to convert high-precision multiplication as low-precision matrix-vector multiplication, (3) Matrix Aligned Transformation (MAT) to covert vectorized modular operation with reduction into matrix multiplication that can be efficiently processed on 2D spatial matrix engine. Our evaluation of CROSS on a Google TPUv4 demonstrates significant performance improvements, with up to 161x and 5x speedup compared to the previous work on many-core CPUs and V100. The kernel-level codes are open-sourced at https://github.com/google/jaxite/tree/main/jaxite_word.
中文: 同态加密可通过将其原语转换为AI算子并在现有AI加速器(如TPU)上运行来加速,通过支持模乘运算和矩阵变换等技术实现了显著的性能提升。
English: Homomorphic encryption can be accelerated by converting its primitives into AI operators and running them on existing AI accelerators like TPUs, achieving significant speedups through techniques such as modular multiplication support and matrix transformations.

Authors:Hoang-Thang Ta, Duy-Quy Thai, Anh Tran, Grigori Sidorov, Alexander Gelbukh
Title: PRKAN: Parameter-Reduced Kolmogorov-Arnold Networks
Abstract:
Kolmogorov-Arnold Networks (KANs) represent an innovation in neural network architectures, offering a compelling alternative to Multi-Layer Perceptrons (MLPs) in models such as Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Transformers. By advancing network design, KANs drive groundbreaking research and enable transformative applications across various scientific domains involving neural networks. However, existing KANs often require significantly more parameters in their network layers than MLPs. To address this limitation, this paper introduces PRKANs (Parameter-Reduced Kolmogorov-Arnold Networks), which employ several methods to reduce the parameter count in KAN layers, making them comparable to MLP layers. Experimental results on the MNIST and Fashion-MNIST datasets demonstrate that PRKANs outperform several existing KANs, and their variant with attention mechanisms rivals the performance of MLPs, albeit with slightly longer training times. Furthermore, the study highlights the advantages of Gaussian Radial Basis Functions (GRBFs) and layer normalization in KAN designs. The repository for this work is available at: https://github.com/hoangthangta/All-KAN.
中文: PRKANs通过参数精简技术改进了Kolmogorov-Arnold网络,在基准数据集上实现了与多层感知机相当的性能,同时保留了KANs的创新优势。
English: PRKANs introduce parameter reduction techniques to Kolmogorov-Arnold Networks, achieving performance comparable to MLPs on benchmark datasets while maintaining KANs' innovative advantages.

Authors:Xuhui Guo, Tanmoy Dam, Rohan Dhamdhere, Gourav Modanwal, Anant Madabhushi
Title: UNetVL: Enhancing 3D Medical Image Segmentation with Chebyshev KAN Powered Vision-LSTM
Abstract:
3D medical image segmentation has progressed considerably due to Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs), yet these methods struggle to balance long-range dependency acquisition with computational efficiency. To address this challenge, we propose UNETVL (U-Net Vision-LSTM), a novel architecture that leverages recent advancements in temporal information processing. UNETVL incorporates Vision-LSTM (ViL) for improved scalability and memory functions, alongside an efficient Chebyshev Kolmogorov-Arnold Networks (KAN) to handle complex and long-range dependency patterns more effectively. We validated our method on the ACDC and AMOS2022 (post challenge Task 2) benchmark datasets, showing a significant improvement in mean Dice score compared to recent state-of-the-art approaches, especially over its predecessor, UNETR, with increases of 7.3% on ACDC and 15.6% on AMOS, respectively. Extensive ablation studies were conducted to demonstrate the impact of each component in UNETVL, providing a comprehensive understanding of its architecture. Our code is available at https://github.com/tgrex6/UNETVL, facilitating further research and applications in this domain.
中文: 提出的UNETVL架构通过融合Vision-LSTM和高效切比雪夫KAN网络,显著提升了三维医学图像分割中长程依赖关系的建模能力,在ACDC和AMOS数据集上的Dice分数分别较现有最优方法提高了7.3%和15.6%。
English: The proposed UNETVL architecture enhances 3D medical image segmentation by integrating Vision-LSTM and efficient Chebyshev KAN to better capture long-range dependencies, achieving significant Dice score improvements of 7.3% on ACDC and 15.6% on AMOS datasets compared to previous methods.

Authors:Binyu Zhang, Shichao Li, Junpeng Jian, Zhu Meng, Limei Guo, Zhicheng Zhao
Title: A Multi-Modal Deep Learning Framework for Pan-Cancer Prognosis
Abstract:
Prognostic task is of great importance as it closely related to the survival analysis of patients, the optimization of treatment plans and the allocation of resources. The existing prognostic models have shown promising results on specific datasets, but there are limitations in two aspects. On the one hand, they merely explore certain types of modal data, such as patient histopathology WSI and gene expression analysis. On the other hand, they adopt the per-cancer-per-model paradigm, which means the trained models can only predict the prognostic effect of a single type of cancer, resulting in weak generalization ability. In this paper, a deep-learning based model, named UMPSNet, is proposed. Specifically, to comprehensively understand the condition of patients, in addition to constructing encoders for histopathology images and genomic expression profiles respectively, UMPSNet further integrates four types of important meta data (demographic information, cancer type information, treatment protocols, and diagnosis results) into text templates, and then introduces a text encoder to extract textual features. In addition, the optimal transport OT-based attention mechanism is utilized to align and fuse features of different modalities. Furthermore, a guided soft mixture of experts (GMoE) mechanism is introduced to effectively address the issue of distribution differences among multiple cancer datasets. By incorporating the multi-modality of patient data and joint training, UMPSNet outperforms all SOTA approaches, and moreover, it demonstrates the effectiveness and generalization ability of the proposed learning paradigm of a single model for multiple cancer types. The code of UMPSNet is available at https://github.com/binging512/UMPSNet.
中文:提出的UMPSNet模型通过整合病理图像、基因组数据和临床文本等多模态信息,采用基于最优运输的注意力机制和引导混合专家方法,在多种癌症类型的预后预测中实现了优于现有单模态单癌症模型的性能与泛化能力。
English: The proposed UMPSNet model integrates multiple data modalities, including histopathology images, genomic profiles, and clinical metadata, using optimal transport-based attention and guided mixture of experts mechanisms to achieve superior prognostic performance and generalization across multiple cancer types compared to existing single-modality, single-cancer models.

Authors:Henry Li, Ronen Basri, Yuval Kluger
Title: Likelihood Training of Cascaded Diffusion Models via Hierarchical Volume-preserving Maps
Abstract:
Cascaded models are multi-scale generative models with a marked capacity for producing perceptually impressive samples at high resolutions. In this work, we show that they can also be excellent likelihood models, so long as we overcome a fundamental difficulty with probabilistic multi-scale models: the intractability of the likelihood function. Chiefly, in cascaded models each intermediary scale introduces extraneous variables that cannot be tractably marginalized out for likelihood evaluation. This issue vanishes by modeling the diffusion process on latent spaces induced by a class of transformations we call hierarchical volume-preserving maps, which decompose spatially structured data in a hierarchical fashion without introducing local distortions in the latent space. We demonstrate that two such maps are well-known in the literature for multiscale modeling: Laplacian pyramids and wavelet transforms. Not only do such reparameterizations allow the likelihood function to be directly expressed as a joint likelihood over the scales, we show that the Laplacian pyramid and wavelet transform also produces significant improvements to the state-of-the-art on a selection of benchmarks in likelihood modeling, including density estimation, lossless compression, and out-of-distribution detection. Investigating the theoretical basis of our empirical gains we uncover deep connections to score matching under the Earth Mover's Distance (EMD), which is a well-known surrogate for perceptual similarity. Code can be found at \href{https://github.com/lihenryhfl/pcdm}{this https url}.
中文摘要:级联模型通过采用拉普拉斯金字塔和小波变换等分层保体积映射,解决了多尺度似然评估的难处理性问题,从而实现了最先进的似然建模性能。
English Summary: Cascaded models can achieve state-of-the-art likelihood modeling by using hierarchical volume-preserving transformations like Laplacian pyramids and wavelet transforms, which overcome the intractability of multi-scale likelihood evaluation.

Authors:Jimeng Shi, Azam Shirali, Bowen Jin, Sizhe Zhou, Wei Hu, Rahuul Rangaraj, Shaowen Wang, Jiawei Han, Zhaonan Wang, Upmanu Lall, Yanzhao Wu, Leonardo Bobadilla, Giri Narasimhan
Title: Deep Learning and Foundation Models for Weather Prediction: A Survey
Abstract:
Physics-based numerical models have been the bedrock of atmospheric sciences for decades, offering robust solutions but often at the cost of significant computational resources. Deep learning (DL) models have emerged as powerful tools in meteorology, capable of analyzing complex weather and climate data by learning intricate dependencies and providing rapid predictions once trained. While these models demonstrate promising performance in weather prediction, often surpassing traditional physics-based methods, they still face critical challenges. This paper presents a comprehensive survey of recent deep learning and foundation models for weather prediction. We propose a taxonomy to classify existing models based on their training paradigms: deterministic predictive learning, probabilistic generative learning, and pre-training and fine-tuning. For each paradigm, we delve into the underlying model architectures, address major challenges, offer key insights, and propose targeted directions for future research. Furthermore, we explore real-world applications of these methods and provide a curated summary of open-source code repositories and widely used datasets, aiming to bridge research advancements with practical implementations while fostering open and trustworthy scientific practices in adopting cutting-edge artificial intelligence for weather prediction. The related sources are available at https://github.com/JimengShi/ DL-Foundation-Models-Weather.
中文摘要:本文综述了用于天气预报的深度学习和基础模型,按训练范式分类并探讨了挑战、应用及开源资源,旨在推动人工智能在气象领域的应用发展。
English Summary: This paper surveys deep learning and foundation models for weather prediction, categorizing them by training paradigms and addressing challenges, applications, and open-source resources to advance AI in meteorology.

Authors:Liyan Chen, Huangying Zhan, Kevin Chen, Xiangyu Xu, Qingan Yan, Changjiang Cai, Yi Xu
Title: ActiveGAMER: Active GAussian Mapping through Efficient Rendering
Abstract:
We introduce ActiveGAMER, an active mapping system that utilizes 3D Gaussian Splatting (3DGS) to achieve high-quality, real-time scene mapping and exploration. Unlike traditional NeRF-based methods, which are computationally demanding and restrict active mapping performance, our approach leverages the efficient rendering capabilities of 3DGS, allowing effective and efficient exploration in complex environments. The core of our system is a rendering-based information gain module that dynamically identifies the most informative viewpoints for next-best-view planning, enhancing both geometric and photometric reconstruction accuracy. ActiveGAMER also integrates a carefully balanced framework, combining coarse-to-fine exploration, post-refinement, and a global-local keyframe selection strategy to maximize reconstruction completeness and fidelity. Our system autonomously explores and reconstructs environments with state-of-the-art geometric and photometric accuracy and completeness, significantly surpassing existing approaches in both aspects. Extensive evaluations on benchmark datasets such as Replica and MP3D highlight ActiveGAMER's effectiveness in active mapping tasks.
中文:ActiveGAMER 是一种利用 3D 高斯泼溅技术的主动建图系统,能够实现实时场景探索与重建,在精度和效率上均优于传统方法。
English: ActiveGAMER is an active mapping system that uses 3D Gaussian Splatting for real-time scene exploration and reconstruction, outperforming traditional methods in accuracy and efficiency.

Authors:Krishna Upadhyay, Vinaik Chhetri, A. B. Siddique, Umar Farooq
Title: Analyzing the Evolution and Maintenance of Quantum Software Repositories
Abstract:
Quantum computing is rapidly advancing, but quantum software development faces significant challenges, including a steep learning curve, high hardware error rates, and a lack of mature engineering practices. This study conducts a large-scale mining analysis of over 21,000 GitHub repositories, containing 1.2 million commits from more than 10,000 developers, to examine the evolution and maintenance of quantum software. We analyze repository growth, programming language and framework adoption, and contributor trends, revealing a 200% increase in repositories and a 150% rise in contributors since 2017. Additionally, we investigate software development and maintenance practices, showing that perfective commits dominate (51.76%), while the low occurrence of corrective commits (18.54%) indicates potential gaps in bug resolution. Furthermore, 34% of reported issues are quantum-specific, highlighting the need for specialized debugging tools beyond conventional software engineering approaches. This study provides empirical insights into the software engineering challenges of quantum computing, offering recommendations to improve development workflows, tooling, and documentation. We are also open-sourcing our dataset to support further analysis by the community and to guide future research and tool development for quantum computing. The dataset is available at: https://github.com/kriss-u/QRepoAnalysis-Paper
中文: 本研究通过分析2.1万个GitHub代码库,揭示了量子软件的快速增长及维护挑战,包括完善性提交占主导地位,以及需要专门调试工具解决的量子特性问题。
English: This study analyzes over 21,000 GitHub repositories to reveal quantum software's rapid growth and maintenance challenges, including a dominance of perfective commits and quantum-specific issues requiring specialized debugging tools.

Authors:Krishna Upadhyay, Vinaik Chhetri, A. B. Siddique, Umar Farooq
Title: Analyzing the Evolution and Maintenance of Quantum Software Repositories
Abstract:
Quantum computing is rapidly advancing, but quantum software development faces significant challenges, including a steep learning curve, high hardware error rates, and a lack of mature engineering practices. This study conducts a large-scale mining analysis of over 21,000 GitHub repositories, containing 1.2 million commits from more than 10,000 developers, to examine the evolution and maintenance of quantum software. We analyze repository growth, programming language and framework adoption, and contributor trends, revealing a 200% increase in repositories and a 150% rise in contributors since 2017. Additionally, we investigate software development and maintenance practices, showing that perfective commits dominate (51.76%), while the low occurrence of corrective commits (18.54%) indicates potential gaps in bug resolution. Furthermore, 34% of reported issues are quantum-specific, highlighting the need for specialized debugging tools beyond conventional software engineering approaches. This study provides empirical insights into the software engineering challenges of quantum computing, offering recommendations to improve development workflows, tooling, and documentation. We are also open-sourcing our dataset to support further analysis by the community and to guide future research and tool development for quantum computing. The dataset is available at: https://github.com/kriss-u/QRepoAnalysis-Paper
中文: 本研究通过分析2.1万个GitHub代码库,揭示了量子软件的快速增长及维护挑战,包括完善性提交占主导地位,以及需要专门调试工具解决的量子特性问题。
English: This study analyzes over 21,000 GitHub repositories to reveal quantum software's rapid growth and maintenance challenges, including a dominance of perfective commits and quantum-specific issues requiring specialized debugging tools.

Authors:Haojun Yu, Di Dai, Ziwei Zhao, Di He, Han Hu, Liwei Wang
Title: LarvSeg: Exploring Image Classification Data For Large Vocabulary Semantic Segmentation via Category-wise Attentive Classifier
Abstract:
Scaling up the vocabulary of semantic segmentation models is extremely challenging because annotating large-scale mask labels is labour-intensive and time-consuming. Recently, language-guided segmentation models have been proposed to address this challenge. However, their performance drops significantly when applied to out-of-distribution categories. In this paper, we propose a new large vocabulary semantic segmentation framework, called LarvSeg. Different from previous works, LarvSeg leverages image classification data to scale the vocabulary of semantic segmentation models as large-vocabulary classification datasets usually contain balanced categories and are much easier to obtain. However, for classification tasks, the category is image-level, while for segmentation we need to predict the label at pixel level. To address this issue, we first propose a general baseline framework to incorporate image-level supervision into the training process of a pixel-level segmentation model, making the trained network perform semantic segmentation on newly introduced categories in the classification data. We then observe that a model trained on segmentation data can group pixel features of categories beyond the training vocabulary. Inspired by this finding, we design a category-wise attentive classifier to apply supervision to the precise regions of corresponding categories to improve the model performance. Extensive experiments demonstrate that LarvSeg significantly improves the large vocabulary semantic segmentation performance, especially in the categories without mask labels. For the first time, we provide a 21K-category semantic segmentation model with the help of ImageNet21K. The code is available at https://github.com/HaojunYu1998/large_voc_seg.
中文: 提出的LarvSeg框架通过利用图像分类数据和类别注意力分类器,显著提升大词汇量语义分割性能,无需大量掩码标注即可扩展至21K个类别。
English: The proposed LarvSeg framework enhances large vocabulary semantic segmentation by utilizing image classification data and a category-wise attentive classifier, effectively scaling to 21K categories without requiring extensive mask annotations.

Authors:Raghav Singhal, Zachary Horvitz, Ryan Teehan, Mengye Ren, Zhou Yu, Kathleen McKeown, Rajesh Ranganath
Title: A General Framework for Inference-time Scaling and Steering of Diffusion Models
Abstract:
Diffusion models produce impressive results in modalities ranging from images and video to protein design and text. However, generating samples with user-specified properties remains a challenge. Recent research proposes fine-tuning models to maximize rewards that capture desired properties, but these methods require expensive training and are prone to mode collapse. In this work, we present Feynman-Kac (FK) steering, an inference-time framework for steering diffusion models with reward functions. FK steering works by sampling a system of multiple interacting diffusion processes, called particles, and resampling particles at intermediate steps based on scores computed using functions called potentials. Potentials are defined using rewards for intermediate states and are selected such that a high value indicates that the particle will yield a high-reward sample. We explore various choices of potentials, intermediate rewards, and samplers. We evaluate FK steering on text-to-image and text diffusion models. For steering text-to-image models with a human preference reward, we find that FK steering a 0.8B parameter model outperforms a 2.6B parameter fine-tuned model on prompt fidelity, with faster sampling and no training. For steering text diffusion models with rewards for text quality and specific text attributes, we find that FK steering generates lower perplexity, more linguistically acceptable outputs and enables gradient-free control of attributes like toxicity. Our results demonstrate that inference-time scaling and steering of diffusion models - even with off-the-shelf rewards - can provide significant sample quality gains and controllability benefits. Code is available at https://github.com/zacharyhorvitz/Fk-Diffusion-Steering .
中文: 本文提出Feynman-Kac引导方法,这是一种无需训练即可通过奖励函数在推理时指导扩散模型的框架,在文本到图像生成和文本扩散任务中,其效果优于经过微调的模型,显著提升了样本质量和可控性。
English: This paper introduces Feynman-Kac (FK) steering, an inference-time framework that guides diffusion models using reward functions to enhance sample quality and controllability without requiring training, achieving superior results in text-to-image generation and text diffusion tasks compared to fine-tuned models.

Authors:Tianjin Huang, Ziquan Zhu, Gaojie Jin, Lu Liu, Zhangyang Wang, Shiwei Liu
Title: SPAM: Spike-Aware Adam with Momentum Reset for Stable LLM Training
Abstract:
Large Language Models (LLMs) have demonstrated exceptional performance across diverse tasks, yet their training remains highly resource-intensive and susceptible to critical challenges such as training instability. A predominant source of this instability stems from gradient and loss spikes, which disrupt the learning process, often leading to costly interventions like checkpoint recovery and experiment restarts, further amplifying inefficiencies. This paper presents a comprehensive investigation into gradient spikes observed during LLM training, revealing their prevalence across multiple architectures and datasets. Our analysis shows that these spikes can be up to $1000\times$ larger than typical gradients, substantially deteriorating model performance. To address this issue, we propose Spike-Aware Adam with Momentum Reset SPAM, a novel optimizer designed to counteract gradient spikes through momentum reset and spike-aware gradient clipping. Extensive experiments, including both pre-training and fine-tuning, demonstrate that SPAM consistently surpasses Adam and its variants across various tasks, including (1) LLM pre-training from 60M to 1B, (2) 4-bit LLM pre-training,(3) reinforcement learning, and (4) Time Series Forecasting. Additionally, SPAM facilitates memory-efficient training by enabling sparse momentum, where only a subset of momentum terms are maintained and updated. When operating under memory constraints, SPAM outperforms state-of-the-art memory-efficient optimizers such as GaLore and Adam-Mini. Our work underscores the importance of mitigating gradient spikes in LLM training and introduces an effective optimization strategy that enhances both training stability and resource efficiency at scale. Code is available at https://github.com/TianjinYellow/SPAM-Optimizer.git
中文: 本文提出SPAM优化器,通过动量重置和梯度裁剪有效缓解大语言模型训练中的梯度尖峰问题,在多种任务中提升训练稳定性与资源效率,性能优于现有方法。
English: This paper introduces SPAM, a novel optimizer that mitigates gradient spikes in LLM training through momentum reset and adaptive clipping, improving stability and efficiency across various tasks and outperforming existing methods.

Authors:Du Chen, Liyi Chen, Zhengqiang Zhang, Lei Zhang
Title: Generalized and Efficient 2D Gaussian Splatting for Arbitrary-scale Super-Resolution
Abstract:
Implicit Neural Representations (INR) have been successfully employed for Arbitrary-scale Super-Resolution (ASR). However, INR-based models need to query the multi-layer perceptron module numerous times and render a pixel in each query, resulting in insufficient representation capability and low computational efficiency. Recently, Gaussian Splatting (GS) has shown its advantages over INR in both visual quality and rendering speed in 3D tasks, which motivates us to explore whether GS can be employed for the ASR task. However, directly applying GS to ASR is exceptionally challenging because the original GS is an optimization-based method through overfitting each single scene, while in ASR we aim to learn a single model that can generalize to different images and scaling factors. We overcome these challenges by developing two novel techniques. Firstly, to generalize GS for ASR, we elaborately design an architecture to predict the corresponding image-conditioned Gaussians of the input low-resolution image in a feed-forward manner. Each Gaussian can fit the shape and direction of an area of complex textures, showing powerful representation capability. Secondly, we implement an efficient differentiable 2D GPU/CUDA-based scale-aware rasterization to render super-resolved images by sampling discrete RGB values from the predicted continuous Gaussians. Via end-to-end training, our optimized network, namely GSASR, can perform ASR for any image and unseen scaling factors. Extensive experiments validate the effectiveness of our proposed method. The code and models are available at https://github.com/ChrisDud0257/GSASR.
中文摘要:本文提出GSASR方法,通过设计图像条件化的高斯预测架构和高效的尺度感知渲染技术,将高斯溅射成功应用于任意尺度超分辨率任务,解决了泛化性难题。
English Summary: The paper introduces GSASR, a novel method that adapts Gaussian Splatting for Arbitrary-scale Super-Resolution by designing an image-conditioned Gaussian prediction architecture and an efficient scale-aware rasterization technique to overcome generalization challenges.

Authors:Minglong Xue, Shuaibin Fan, Shivakumara Palaiahnakote, Mingliang Zhou
Title: UR2P-Dehaze: Learning a Simple Image Dehaze Enhancer via Unpaired Rich Physical Prior
Abstract:
Image dehazing techniques aim to enhance contrast and restore details, which are essential for preserving visual information and improving image processing accuracy. Existing methods rely on a single manual prior, which cannot effectively reveal image details. To overcome this limitation, we propose an unpaired image dehazing network, called the Simple Image Dehaze Enhancer via Unpaired Rich Physical Prior (UR2P-Dehaze). First, to accurately estimate the illumination, reflectance, and color information of the hazy image, we design a shared prior estimator (SPE) that is iteratively trained to ensure the consistency of illumination and reflectance, generating clear, high-quality images. Additionally, a self-monitoring mechanism is introduced to eliminate undesirable features, providing reliable priors for image reconstruction. Next, we propose Dynamic Wavelet Separable Convolution (DWSC), which effectively integrates key features across both low and high frequencies, significantly enhancing the preservation of image details and ensuring global consistency. Finally, to effectively restore the color information of the image, we propose an Adaptive Color Corrector that addresses the problem of unclear colors. The PSNR, SSIM, LPIPS, FID and CIEDE2000 metrics on the benchmark dataset show that our method achieves state-of-the-art performance. It also contributes to the performance improvement of downstream tasks. The project code will be available at https://github.com/Fan-pixel/UR2P-Dehaze. \end{abstract}
中文摘要:UR2P-Dehaze网络通过共享先验估计器和动态小波卷积整合非配对丰富物理先验,有效克服单先验方法的局限,在去雾性能、细节保留和色彩恢复方面达到最优水平。
English Summary: The proposed UR2P-Dehaze network overcomes limitations of single-prior methods by integrating unpaired rich physical priors through shared estimation and dynamic wavelet convolution, achieving state-of-the-art dehazing performance with enhanced detail preservation and color restoration.

Authors:Keyan Chen, Jiafan Zhang, Chenyang Liu, Zhengxia Zou, Zhenwei Shi
Title: RSRefSeg: Referring Remote Sensing Image Segmentation with Foundation Models
Abstract:
Referring remote sensing image segmentation is crucial for achieving fine-grained visual understanding through free-format textual input, enabling enhanced scene and object extraction in remote sensing applications. Current research primarily utilizes pre-trained language models to encode textual descriptions and align them with visual modalities, thereby facilitating the expression of relevant visual features. However, these approaches often struggle to establish robust alignments between fine-grained semantic concepts, leading to inconsistent representations across textual and visual information. To address these limitations, we introduce a referring remote sensing image segmentation foundational model, RSRefSeg. RSRefSeg leverages CLIP for visual and textual encoding, employing both global and local textual semantics as filters to generate referring-related visual activation features in the latent space. These activated features then serve as input prompts for SAM, which refines the segmentation masks through its robust visual generalization capabilities. Experimental results on the RRSIS-D dataset demonstrate that RSRefSeg outperforms existing methods, underscoring the effectiveness of foundational models in enhancing multimodal task comprehension. The code is available at \url{https://github.com/KyanChen/RSRefSeg}.
中文摘要:RSRefSeg模型通过结合CLIP的多模态编码与SAM的分割优化能力,显著提升了遥感图像指代分割的精度,在基准测试中凭借更精准的图文特征对齐优于现有方法。
English Summary: The RSRefSeg model enhances referring remote sensing image segmentation by integrating CLIP's multimodal encoding with SAM's segmentation refinement, achieving superior performance on benchmark datasets through improved alignment of textual and visual features.

Authors:Mahmoud Ahmed, Xiang Li, Arpit Prajapati, Mohamed Elhoseiny
Title: 3DCoMPaT200: Language-Grounded Compositional Understanding of Parts and Materials of 3D Shapes
Abstract:
Understanding objects in 3D at the part level is essential for humans and robots to navigate and interact with the environment. Current datasets for part-level 3D object understanding encompass a limited range of categories. For instance, the ShapeNet-Part and PartNet datasets only include 16, and 24 object categories respectively. The 3DCoMPaT dataset, specifically designed for compositional understanding of parts and materials, contains only 42 object categories. To foster richer and fine-grained part-level 3D understanding, we introduce 3DCoMPaT200, a large-scale dataset tailored for compositional understanding of object parts and materials, with 200 object categories with $\approx$5 times larger object vocabulary compared to 3DCoMPaT and $\approx$ 4 times larger part categories. Concretely, 3DCoMPaT200 significantly expands upon 3DCoMPaT, featuring 1,031 fine-grained part categories and 293 distinct material classes for compositional application to 3D object parts. Additionally, to address the complexities of compositional 3D modeling, we propose a novel task of Compositional Part Shape Retrieval using ULIP to provide a strong 3D foundational model for 3D Compositional Understanding. This method evaluates the model shape retrieval performance given one, three, or six parts described in text format. These results show that the model's performance improves with an increasing number of style compositions, highlighting the critical role of the compositional dataset. Such results underscore the dataset's effectiveness in enhancing models' capability to understand complex 3D shapes from a compositional perspective. Code and Data can be found at http://github.com/3DCoMPaT200/3DCoMPaT200
中文: 为推进部件级三维物体理解,研究者提出了大规模数据集3DCoMPaT200,涵盖200个物体类别并显著扩展了部件与材料分类,同时通过ULIP框架创新性地引入组合式部件检索任务,证明模型性能随组合复杂度提升而增强。
English: To advance part-level 3D object understanding, the authors introduce 3DCoMPaT200, a large-scale dataset with 200 object categories, significantly expanding part and material classes, and propose a compositional part shape retrieval task using ULIP to enhance model performance with increasing compositional complexity.

Authors:Shaw Walters, Sam Gao, Shakker Nerd, Feng Da, Warren Williams, Ting-Chien Meng, Amie Chow, Hunter Han, Frank He, Allen Zhang, Ming Wu, Timothy Shen, Maxwell Hu, Jerry Yan
Title: Eliza: A Web3 friendly AI Agent Operating System
Abstract:
AI Agent, powered by large language models (LLMs) as its cognitive core, is an intelligent agentic system capable of autonomously controlling and determining the execution paths under user's instructions. With the burst of capabilities of LLMs and various plugins, such as RAG, text-to-image/video/3D, etc., the potential of AI Agents has been vastly expanded, with their capabilities growing stronger by the day. However, at the intersection between AI and web3, there is currently no ideal agentic framework that can seamlessly integrate web3 applications into AI agent functionalities. In this paper, we propose Eliza, the first open-source web3-friendly Agentic framework that makes the deployment of web3 applications effortless. We emphasize that every aspect of Eliza is a regular Typescript program under the full control of its user, and it seamlessly integrates with web3 (i.e., reading and writing blockchain data, interacting with smart contracts, etc.). Furthermore, we show how stable performance is achieved through the pragmatic implementation of the key components of Eliza's runtime. Our code is publicly available at https://github.com/ai16z/eliza.
中文: Eliza是首个开源的、兼容web3的智能体框架,它通过用户完全控制的TypeScript程序,实现了AI代理与web3应用的无缝集成,支持区块链数据读写和智能合约交互。
English: Eliza is the first open-source, web3-friendly agentic framework that enables seamless integration of web3 applications into AI agents, allowing autonomous control and blockchain interactions through user-directed Typescript programs.

Authors:Ji Soo Lee, Jongha Kim, Jeehye Na, Jinyoung Park, Hyunwoo J. Kim
Title: VidChain: Chain-of-Tasks with Metric-based Direct Preference Optimization for Dense Video Captioning
Abstract:
Despite the advancements of Video Large Language Models (VideoLLMs) in various tasks, they struggle with fine-grained temporal understanding, such as Dense Video Captioning (DVC). DVC is a complicated task of describing all events within a video while also temporally localizing them, which integrates multiple fine-grained tasks, including video segmentation, video captioning, and temporal video grounding. Previous VideoLLMs attempt to solve DVC in a single step, failing to utilize their reasoning capability. Moreover, previous training objectives for VideoLLMs do not fully reflect the evaluation metrics, therefore not providing supervision directly aligned to target tasks. To address such a problem, we propose a novel framework named VidChain comprised of Chain-of-Tasks (CoTasks) and Metric-based Direct Preference Optimization (M-DPO). CoTasks decompose a complex task into a sequence of sub-tasks, allowing VideoLLMs to leverage their reasoning capabilities more effectively. M-DPO aligns a VideoLLM with evaluation metrics, providing fine-grained supervision to each task that is well-aligned with metrics. Applied to two different VideoLLMs, VidChain consistently improves their fine-grained video understanding, thereby outperforming previous VideoLLMs on two different DVC benchmarks and also on the temporal video grounding task. Code is available at \url{https://github.com/mlvlab/VidChain}.
中文摘要:提出的VidChain框架通过将复杂任务分解为子任务并将模型训练与评估指标对齐,增强了视频大语言模型的细粒度时序理解能力,在密集视频描述基准测试中实现了更优性能。
English Summary: The proposed VidChain framework enhances VideoLLMs' fine-grained temporal understanding by decomposing complex tasks into sub-tasks and aligning model training with evaluation metrics, achieving superior performance on Dense Video Captioning benchmarks.

Authors:Tianyu Fan, Jingyuan Wang, Xubin Ren, Chao Huang
Title: MiniRAG: Towards Extremely Simple Retrieval-Augmented Generation
Abstract:
The growing demand for efficient and lightweight Retrieval-Augmented Generation (RAG) systems has highlighted significant challenges when deploying Small Language Models (SLMs) in existing RAG frameworks. Current approaches face severe performance degradation due to SLMs' limited semantic understanding and text processing capabilities, creating barriers for widespread adoption in resource-constrained scenarios. To address these fundamental limitations, we present MiniRAG, a novel RAG system designed for extreme simplicity and efficiency. MiniRAG introduces two key technical innovations: (1) a semantic-aware heterogeneous graph indexing mechanism that combines text chunks and named entities in a unified structure, reducing reliance on complex semantic understanding, and (2) a lightweight topology-enhanced retrieval approach that leverages graph structures for efficient knowledge discovery without requiring advanced language capabilities. Our extensive experiments demonstrate that MiniRAG achieves comparable performance to LLM-based methods even when using SLMs while requiring only 25\% of the storage space. Additionally, we contribute a comprehensive benchmark dataset for evaluating lightweight RAG systems under realistic on-device scenarios with complex queries. We fully open-source our implementation and datasets at: https://github.com/HKUDS/MiniRAG.
中文:MiniRAG是一种新型轻量级RAG系统,通过语义感知图索引和拓扑增强检索技术克服了小语言模型的性能限制,仅需25%存储空间即可实现与基于大语言模型方法相当的性能。
English: MiniRAG is a novel lightweight RAG system that overcomes Small Language Models' limitations through semantic-aware graph indexing and topology-enhanced retrieval, achieving comparable performance to LLM-based methods with only 25% storage space.

Authors:Ming Dai, Jian Li, Jiedong Zhuang, Xian Zhang, Wankou Yang
Title: Multi-task Visual Grounding with Coarse-to-Fine Consistency Constraints
Abstract:
Multi-task visual grounding involves the simultaneous execution of localization and segmentation in images based on textual expressions. The majority of advanced methods predominantly focus on transformer-based multimodal fusion, aiming to extract robust multimodal representations. However, ambiguity between referring expression comprehension (REC) and referring image segmentation (RIS) is error-prone, leading to inconsistencies between multi-task predictions. Besides, insufficient multimodal understanding directly contributes to biased target perception. To overcome these challenges, we propose a Coarse-to-fine Consistency Constraints Visual Grounding architecture ($\text{C}^3\text{VG}$), which integrates implicit and explicit modeling approaches within a two-stage framework. Initially, query and pixel decoders are employed to generate preliminary detection and segmentation outputs, a process referred to as the Rough Semantic Perception (RSP) stage. These coarse predictions are subsequently refined through the proposed Mask-guided Interaction Module (MIM) and a novel explicit bidirectional consistency constraint loss to ensure consistent representations across tasks, which we term the Refined Consistency Interaction (RCI) stage. Furthermore, to address the challenge of insufficient multimodal understanding, we leverage pre-trained models based on visual-linguistic fusion representations. Empirical evaluations on the RefCOCO, RefCOCO+, and RefCOCOg datasets demonstrate the efficacy and soundness of $\text{C}^3\text{VG}$, which significantly outperforms state-of-the-art REC and RIS methods by a substantial margin. Code and model will be available at \url{https://github.com/Dmmm1997/C3VG}.
中文: 本文提出C³VG架构,通过粗到精的一致性约束,结合隐式和显式建模解决多任务视觉定位中的歧义问题,在基准数据集上显著优于现有最优方法。
English: The paper introduces C³VG, a coarse-to-fine architecture with consistency constraints that enhances multi-task visual grounding by integrating implicit and explicit modeling to resolve ambiguities between localization and segmentation, achieving state-of-the-art performance on benchmark datasets.

Authors:Veronika Smilga
Title: Scaling Down Semantic Leakage: Investigating Associative Bias in Smaller Language Models
Abstract:
Semantic leakage is a phenomenon recently introduced by Gonen et al. (2024). It refers to a situation in which associations learnt from the training data emerge in language model generations in an unexpected and sometimes undesired way. Prior work has focused on leakage in large language models (7B+ parameters). In this study, I use Qwen2.5 model family to explore whether smaller models, ranging from 500M to 7B parameters, demonstrate less semantic leakage due to their limited capacity for capturing complex associations. Building on the previous dataset from Gonen et al. (2024), I introduce a new dataset of color-focused prompts, categorized into specific types of semantic associations, to systematically evaluate the models' performance. Results indicate that smaller models exhibit less semantic leakage overall, although this trend is not strictly linear, with medium-sized models sometimes surpassing larger ones in leaking behavior. The dataset, the model generations, and the evaluation code are publicly available at https://github.com/smilni/semantic_leakage_project.
中文: 本研究探讨了较小规模Qwen2.5模型(5亿至70亿参数)中的语义泄露现象,发现模型容量减小通常与较少泄露相关,但这种关系并非线性,中等规模模型有时反而表现出比大模型更明显的泄露行为。
English: This study investigates semantic leakage in smaller Qwen2.5 models (500M-7B parameters), revealing that reduced model capacity generally correlates with less leakage, though the relationship is non-linear and medium-sized models occasionally exhibit more leakage than larger ones.

Authors:Xuanle Zhao, Xianzhen Luo, Qi Shi, Chi Chen, Shuo Wang, Zhiyuan Liu, Maosong Sun
Title: ChartCoder: Advancing Multimodal Large Language Model for Chart-to-Code Generation
Abstract:
Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in chart understanding tasks. However, interpreting charts with textual descriptions often leads to information loss, as it fails to fully capture the dense information embedded in charts. In contrast, parsing charts into code provides lossless representations that can effectively contain all critical details. Although existing open-source MLLMs have achieved success in chart understanding tasks, they still face two major challenges when applied to chart-to-code tasks: (1) Low executability and poor restoration of chart details in the generated code and (2) Lack of large-scale and diverse training data. To address these challenges, we propose \textbf{ChartCoder}, the first dedicated chart-to-code MLLM, which leverages Code LLMs as the language backbone to enhance the executability of the generated code. Furthermore, we introduce \textbf{Chart2Code-160k}, the first large-scale and diverse dataset for chart-to-code generation, and propose the \textbf{Snippet-of-Thought (SoT)} method, which transforms direct chart-to-code generation data into step-by-step generation. Experiments demonstrate that ChartCoder, with only 7B parameters, surpasses existing open-source MLLMs on chart-to-code benchmarks, achieving superior chart restoration and code excitability. Our code is available at https://github.com/thunlp/ChartCoder.
中文:ChartCoder是首个专用于图表转代码的多模态大语言模型,通过采用代码大模型作为语言主干并结合大规模数据集及分步生成方法,有效解决了代码可执行性和数据不足的挑战,仅用70亿参数即在图表还原和代码执行方面超越现有开源模型。
English: ChartCoder is the first dedicated multimodal large language model for chart-to-code conversion, addressing executability and data scarcity issues by using Code LLMs as its backbone and introducing a large-scale dataset with step-by-step generation methods, achieving superior performance with only 7B parameters.

Authors:Narges Rashvand, Ghazal Alinezhad Noghre, Armin Danesh Pazho, Shanle Yao, Hamed Tabkhi
Title: Exploring Pose-Based Anomaly Detection for Retail Security: A Real-World Shoplifting Dataset and Benchmark
Abstract:
Shoplifting poses a significant challenge for retailers, resulting in billions of dollars in annual losses. Traditional security measures often fall short, highlighting the need for intelligent solutions capable of detecting shoplifting behaviors in real time. This paper frames shoplifting detection as an anomaly detection problem, focusing on the identification of deviations from typical shopping patterns. We introduce PoseLift, a privacy-preserving dataset specifically designed for shoplifting detection, addressing challenges such as data scarcity, privacy concerns, and model biases. PoseLift is built in collaboration with a retail store and contains anonymized human pose data from real-world scenarios. By preserving essential behavioral information while anonymizing identities, PoseLift balances privacy and utility. We benchmark state-of-the-art pose-based anomaly detection models on this dataset, evaluating performance using a comprehensive set of metrics. Our results demonstrate that pose-based approaches achieve high detection accuracy while effectively addressing privacy and bias concerns inherent in traditional methods. As one of the first datasets capturing real-world shoplifting behaviors, PoseLift offers researchers a valuable tool to advance computer vision ethically and will be publicly available to foster innovation and collaboration. The dataset is available at https://github.com/TeCSAR-UNCC/PoseLift.
中文: 本文提出PoseLift这一隐私保护数据集,通过匿名化人体姿态数据将商店盗窃行为作为异常检测问题处理,在解决隐私和偏见问题的同时实现了高检测准确率。
English: This paper introduces PoseLift, a privacy-preserving dataset using anonymized human pose data to detect shoplifting as an anomaly detection problem, achieving high accuracy while addressing privacy and bias concerns.

Authors:Xiangru Tang, Tianyu Hu, Muyang Ye, Yanjun Shao, Xunjian Yin, Siru Ouyang, Wangchunshu Zhou, Pan Lu, Zhuosheng Zhang, Yilun Zhao, Arman Cohan, Mark Gerstein
Title: ChemAgent: Self-updating Library in Large Language Models Improves Chemical Reasoning
Abstract:
Chemical reasoning usually involves complex, multi-step processes that demand precise calculations, where even minor errors can lead to cascading failures. Furthermore, large language models (LLMs) encounter difficulties handling domain-specific formulas, executing reasoning steps accurately, and integrating code effectively when tackling chemical reasoning tasks. To address these challenges, we present ChemAgent, a novel framework designed to improve the performance of LLMs through a dynamic, self-updating library. This library is developed by decomposing chemical tasks into sub-tasks and compiling these sub-tasks into a structured collection that can be referenced for future queries. Then, when presented with a new problem, ChemAgent retrieves and refines pertinent information from the library, which we call memory, facilitating effective task decomposition and the generation of solutions. Our method designs three types of memory and a library-enhanced reasoning component, enabling LLMs to improve over time through experience. Experimental results on four chemical reasoning datasets from SciBench demonstrate that ChemAgent achieves performance gains of up to 46% (GPT-4), significantly outperforming existing methods. Our findings suggest substantial potential for future applications, including tasks such as drug discovery and materials science. Our code can be found at https://github.com/gersteinlab/chemagent
Chinese: ChemAgent是一种创新框架,通过动态自更新的任务分解库和记忆检索机制提升大语言模型在化学推理中的表现,在多个数据集上实现高达46%的性能提升,展现出在药物发现等领域的应用潜力。
English: ChemAgent is a novel framework that enhances large language models' chemical reasoning by using a dynamic, self-updating library for task decomposition and memory retrieval, achieving up to 46% performance gains on datasets and showing promise for applications like drug discovery.

Authors:Tomohiko Nakamura, Kwanghee Choi, Keigo Hojo, Yoshiaki Bando, Satoru Fukayama, Shinji Watanabe
Title: Discrete Speech Unit Extraction via Independent Component Analysis
Abstract:
Self-supervised speech models (S3Ms) have become a common tool for the speech processing community, leveraging representations for downstream tasks. Clustering S3M representations yields discrete speech units (DSUs), which serve as compact representations for speech signals. DSUs are typically obtained by k-means clustering. Using DSUs often leads to strong performance in various tasks, including automatic speech recognition (ASR). However, even with the high dimensionality and redundancy of S3M representations, preprocessing S3M representations for better clustering remains unexplored, even though it can affect the quality of DSUs. In this paper, we investigate the potential of linear preprocessing methods for extracting DSUs. We evaluate standardization, principal component analysis, whitening, and independent component analysis (ICA) on DSU-based ASR benchmarks and demonstrate their effectiveness as preprocessing for k-means. We also conduct extensive analyses of their behavior, such as orthogonality or interpretability of individual components of ICA.
Chinese: 本文研究了标准化、主成分分析、白化和独立成分分析等线性预处理方法,用于优化自监督语音模型表示的聚类以生成离散语音单元,并验证了它们在提升自动语音识别性能方面的有效性。
English: This paper explores linear preprocessing methods like standardization, PCA, whitening, and ICA to enhance the clustering of self-supervised speech model representations for discrete speech units, demonstrating their effectiveness in improving automatic speech recognition performance.

Authors:Xianwei Zhuang, Zhihong Zhu, Yuxin Xie, Liming Liang, Yuexian Zou
Title: VASparse: Towards Efficient Visual Hallucination Mitigation via Visual-Aware Token Sparsification
Abstract:
Large Vision-Language Models (LVLMs) may produce outputs that are unfaithful to reality, also known as visual hallucinations (VH), which significantly impedes their real-world usage. To alleviate VH, various decoding strategies have been proposed to enhance visual information. However, many of these methods may require secondary decoding and rollback, which significantly reduces inference speed. In this work, we propose an efficient plug-and-play decoding algorithm via Visual-Aware Sparsification (VASparse) from the perspective of token sparsity for mitigating VH. VASparse is inspired by empirical observations: (1) the sparse activation of attention in LVLMs, and (2) visual-agnostic tokens sparsification exacerbates VH. Based on these insights, we propose a novel token sparsification strategy that balances efficiency and trustworthiness. Specifically, VASparse implements a visual-aware token selection strategy during decoding to reduce redundant tokens while preserving visual context effectively. Additionally, we innovatively introduce a sparse-based visual contrastive decoding method to recalibrate the distribution of hallucinated outputs without the time overhead associated with secondary decoding. Subsequently, VASparse recalibrates attention scores to penalize attention sinking of LVLMs towards text tokens. Extensive experiments across four popular benchmarks confirm the effectiveness of VASparse in mitigating VH across different LVLM families without requiring additional training or post-processing. Impressively, VASparse achieves state-of-the-art performance for mitigating VH while maintaining competitive decoding speed. Code is available at https://github.com/mengchuang123/VASparse-github.
中文: VASparse是一种高效的即插即用解码算法,通过视觉感知的令牌稀疏化和对比解码来减少大型视觉语言模型中的视觉幻觉,在不影响推理速度的情况下实现了最先进的性能。
English: VASparse is an efficient plug-and-play decoding algorithm that mitigates visual hallucinations in large vision-language models by implementing visual-aware token sparsification and contrastive decoding, achieving state-of-the-art performance without compromising inference speed.

Authors:Yiheng Li, Yang Yang, Zhen Lei
Title: CoreNet: Conflict Resolution Network for Point-Pixel Misalignment and Sub-Task Suppression of 3D LiDAR-Camera Object Detection
Abstract:
Fusing multi-modality inputs from different sensors is an effective way to improve the performance of 3D object detection. However, current methods overlook two important conflicts: point-pixel misalignment and sub-task suppression. The former means a pixel feature from the opaque object is projected to multiple point features of the same ray in the world space, and the latter means the classification prediction and bounding box regression may cause mutual suppression. In this paper, we propose a novel method named Conflict Resolution Network (CoreNet) to address the aforementioned issues. Specifically, we first propose a dual-stream transformation module to tackle point-pixel misalignment. It consists of ray-based and point-based 2D-to-BEV transformations. Both of them achieve approximately unique mapping from the image space to the world space. Moreover, we introduce a task-specific predictor to tackle sub-task suppression. It uses the dual-branch structure which adopts class-specific query and Bbox-specific query to corresponding sub-tasks. Each task-specific query is constructed of task-specific feature and general feature, which allows the heads to adaptively select information of interest based on different sub-tasks. Experiments on the large-scale nuScenes dataset demonstrate the superiority of our proposed CoreNet, by achieving 75.6\% NDS and 73.3\% mAP on the nuScenes test set without test-time augmentation and model ensemble techniques. The ample ablation study also demonstrates the effectiveness of each component. The code is released on https://github.com/liyih/CoreNet.
Chinese: 提出的冲突解决网络(CoreNet)通过双流变换模块和任务特定预测器,有效解决了多模态3D物体检测中的点像素不对齐和子任务抑制问题,在nuScenes数据集上取得了最优性能。
English: The proposed Conflict Resolution Network (CoreNet) effectively addresses point-pixel misalignment and sub-task suppression in multi-modality 3D object detection through dual-stream transformation and task-specific predictors, achieving state-of-the-art performance on the nuScenes dataset.

Authors:Tushar Aggarwal, Aarohi Bhand
Title: PASS: Presentation Automation for Slide Generation and Speech
Abstract:
In today's fast-paced world, effective presentations have become an essential tool for communication in both online and offline meetings. The crafting of a compelling presentation requires significant time and effort, from gathering key insights to designing slides that convey information clearly and concisely. However, despite the wealth of resources available, people often find themselves manually extracting crucial points, analyzing data, and organizing content in a way that ensures clarity and impact. Furthermore, a successful presentation goes beyond just the slides; it demands rehearsal and the ability to weave a captivating narrative to fully engage the audience. Although there has been some exploration of automating document-to-slide generation, existing research is largely centered on converting research papers. In addition, automation of the delivery of these presentations has yet to be addressed. We introduce PASS, a pipeline used to generate slides from general Word documents, going beyond just research papers, which also automates the oral delivery of the generated slides. PASS analyzes user documents to create a dynamic, engaging presentation with an AI-generated voice. Additionally, we developed an LLM-based evaluation metric to assess our pipeline across three critical dimensions of presentations: relevance, coherence, and redundancy. The data and codes are available at https://github.com/AggarwalTushar/PASS.
中文摘要:PASS是一种创新流程,能从通用Word文档自动生成演示文稿并实现AI语音播报,同时采用基于大语言模型的评估指标来衡量内容的相关性、连贯性和冗余度。
English Summary: PASS is an innovative pipeline that automates the creation and delivery of presentations from general Word documents, utilizing AI-generated voice and an LLM-based metric to evaluate relevance, coherence, and redundancy.

Authors:Rui Liu, Zhenqi Jia, Feilong Bao, Haizhou Li
Title: Retrieval-Augmented Dialogue Knowledge Aggregation for Expressive Conversational Speech Synthesis
Abstract:
Conversational speech synthesis (CSS) aims to take the current dialogue (CD) history as a reference to synthesize expressive speech that aligns with the conversational style. Unlike CD, stored dialogue (SD) contains preserved dialogue fragments from earlier stages of user-agent interaction, which include style expression knowledge relevant to scenarios similar to those in CD. Note that this knowledge plays a significant role in enabling the agent to synthesize expressive conversational speech that generates empathetic feedback. However, prior research has overlooked this aspect. To address this issue, we propose a novel Retrieval-Augmented Dialogue Knowledge Aggregation scheme for expressive CSS, termed RADKA-CSS, which includes three main components: 1) To effectively retrieve dialogues from SD that are similar to CD in terms of both semantic and style. First, we build a stored dialogue semantic-style database (SDSSD) which includes the text and audio samples. Then, we design a multi-attribute retrieval scheme to match the dialogue semantic and style vectors of the CD with the stored dialogue semantic and style vectors in the SDSSD, retrieving the most similar dialogues. 2) To effectively utilize the style knowledge from CD and SD, we propose adopting the multi-granularity graph structure to encode the dialogue and introducing a multi-source style knowledge aggregation mechanism. 3) Finally, the aggregated style knowledge are fed into the speech synthesizer to help the agent synthesize expressive speech that aligns with the conversational style. We conducted a comprehensive and in-depth experiment based on the DailyTalk dataset, which is a benchmarking dataset for the CSS task. Both objective and subjective evaluations demonstrate that RADKA-CSS outperforms baseline models in expressiveness rendering. Code and audio samples can be found at: https://github.com/Coder-jzq/RADKA-CSS.
Chinese: 提出的RADKA-CSS框架通过检索和聚合存储对话中的相关风格知识,显著提升了对话语音合成的表现力,优于现有基准模型。
English: The proposed RADKA-CSS framework enhances conversational speech synthesis by retrieving and aggregating relevant style knowledge from stored dialogues, significantly improving expressiveness over previous methods.

Authors:Yifan Zhang, Yifeng Liu, Huizhuo Yuan, Zhen Qin, Yang Yuan, Quanquan Gu, Andrew C Yao
Title: Tensor Product Attention Is All You Need
Abstract:
Scaling language models to handle longer input sequences typically necessitates large key-value (KV) caches, resulting in substantial memory overhead during inference. In this paper, we propose Tensor Product Attention (TPA), a novel attention mechanism that uses tensor decompositions to represent queries, keys, and values compactly, substantially shrinking the KV cache size at inference time. By factorizing these representations into contextual low-rank components and seamlessly integrating with Rotary Position Embedding (RoPE), TPA achieves improved model quality alongside memory efficiency. Based on TPA, we introduce the Tensor Product Attention Transformer,(T6), a new model architecture for sequence modeling. Through extensive empirical evaluation on language modeling tasks, we demonstrate that T6 surpasses or matches the performance of standard Transformer baselines, including Multi-Head Attention (MHA), Multi-Query Attention (MQA), Grouped-Query Attention (GQA), and Multi-Head Latent Attention (MLA) across various metrics, including perplexity and a range of established evaluation benchmarks. Notably, TPA's memory efficiency and computational efficiency at the decoding stage enable processing longer sequences under fixed resource constraints, addressing a critical scalability challenge in modern language models. The code is available at https://github.com/tensorgi/T6.
中文: 本文提出张量积注意力(TPA),这是一种通过张量分解压缩键值缓存的新机制,能在语言任务中减少内存占用,同时保持或提升模型性能。
English: The paper introduces Tensor Product Attention (TPA), a novel mechanism that compresses key-value caches using tensor decompositions to reduce memory usage while maintaining or improving model performance in language tasks.

Authors:Jerry Chee, Arturs Backurs, Rainie Heck, Li Zhang, Janardhan Kulkarni, Thomas Rothvoss, Sivakanth Gopi
Title: DiscQuant: A Quantization Method for Neural Networks Inspired by Discrepancy Theory
Abstract:
Quantizing the weights of a neural network has two steps: (1) Finding a good low bit-complexity representation for weights (which we call the quantization grid) and (2) Rounding the original weights to values in the quantization grid. In this paper, we study the problem of rounding optimally given any quantization grid. The simplest and most commonly used way to round is Round-to-Nearest (RTN). By rounding in a data-dependent way instead, one can improve the quality of the quantized model significantly. We study the rounding problem from the lens of \emph{discrepancy theory}, which studies how well we can round a continuous solution to a discrete solution without affecting solution quality too much. We prove that given $m=\mathrm{poly}(1/ε)$ samples from the data distribution, we can round all but $O(m)$ model weights such that the expected approximation error of the quantized model on the true data distribution is $\le ε$ as long as the space of gradients of the original model is approximately low rank (which we empirically validate). Our proof, which is algorithmic, inspired a simple and practical rounding algorithm called \emph{DiscQuant}. In our experiments, we demonstrate that DiscQuant significantly improves over the prior state-of-the-art rounding method called GPTQ and the baseline RTN over a range of benchmarks on Phi3mini-3.8B and Llama3.1-8B. For example, rounding Phi3mini-3.8B to a fixed quantization grid with 3.25 bits per parameter using DiscQuant gets 64\% accuracy on the GSM8k dataset, whereas GPTQ achieves 54\% and RTN achieves 31\% (the original model achieves 84\%). We make our code available at https://github.com/jerry-chee/DiscQuant.
中文摘要:本文提出DiscQuant方法,基于差异理论通过数据依赖的舍入优化神经网络权重量化,在多个基准测试中显著超越了GPTQ和舍入至最近等现有技术。
English Summary: This paper introduces DiscQuant, a data-dependent rounding method that leverages discrepancy theory to optimize neural network weight quantization, significantly outperforming prior techniques like GPTQ and RTN across multiple benchmarks.

Authors:José Ramón Pareja Monturiol, Alejandro Pozas-Kerstjens, David Pérez-García
Title: Tensorization of neural networks for improved privacy and interpretability
Abstract:
We present a tensorization algorithm for constructing tensor train representations of functions, drawing on sketching and cross interpolation ideas. The method only requires black-box access to the target function and a small set of sample points defining the domain of interest. Thus, it is particularly well-suited for machine learning models, where the domain of interest is naturally defined by the training dataset. We show that this approach can be used to enhance the privacy and interpretability of neural network models. Specifically, we apply our decomposition to (i) obfuscate neural networks whose parameters encode patterns tied to the training data distribution, and (ii) estimate topological phases of matter that are easily accessible from the tensor train representation. Additionally, we show that this tensorization can serve as an efficient initialization method for optimizing tensor trains in general settings, and that, for model compression, our algorithm achieves a superior trade-off between memory and time complexity compared to conventional tensorization methods of neural networks.
中文: 该张量化算法通过素描和交叉插值构建张量链表示,仅需黑盒函数访问和小样本集,特别适用于提升机器学习模型的隐私性、可解释性及效率。
English: This tensorization algorithm constructs tensor train representations using sketching and cross interpolation, requiring only black-box function access and a small sample set, making it ideal for enhancing privacy, interpretability, and efficiency in machine learning models.

Authors:Jing Guo, Nan Li, Ming Xu
Title: Environmental large language model Evaluation (ELLE) dataset: A Benchmark for Evaluating Generative AI applications in Eco-environment Domain
Abstract:
Generative AI holds significant potential for ecological and environmental applications such as monitoring, data analysis, education, and policy support. However, its effectiveness is limited by the lack of a unified evaluation framework. To address this, we present the Environmental Large Language model Evaluation (ELLE) question answer (QA) dataset, the first benchmark designed to assess large language models and their applications in ecological and environmental sciences. The ELLE dataset includes 1,130 question answer pairs across 16 environmental topics, categorized by domain, difficulty, and type. This comprehensive dataset standardizes performance assessments in these fields, enabling consistent and objective comparisons of generative AI performance. By providing a dedicated evaluation tool, ELLE dataset promotes the development and application of generative AI technologies for sustainable environmental outcomes. The dataset and code are available at https://elle.ceeai.net/ and https://github.com/CEEAI/elle.
中文: ELLE数据集作为首个生态与环境科学领域大语言模型评估基准,通过标准化性能测评推动生成式AI技术在可持续发展中的应用。
English: The ELLE dataset introduces the first benchmark for evaluating large language models in ecological and environmental sciences, providing a standardized tool to enhance generative AI applications for sustainable outcomes.

Authors:Huaiguang Cai
Title: CAMs as Shapley Value-based Explainers
Abstract:
Class Activation Mapping (CAM) methods are widely used to visualize neural network decisions, yet their underlying mechanisms remain incompletely understood. To enhance the understanding of CAM methods and improve their explainability, we introduce the Content Reserved Game-theoretic (CRG) Explainer. This theoretical framework clarifies the theoretical foundations of GradCAM and HiResCAM by modeling the neural network prediction process as a cooperative game. Within this framework, we develop ShapleyCAM, a new method that leverages gradients and the Hessian matrix to provide more precise and theoretically grounded visual explanations. Due to the computational infeasibility of exact Shapley value calculation, ShapleyCAM employs a second-order Taylor expansion of the cooperative game's utility function to derive a closed-form expression. Additionally, we propose the Residual Softmax Target-Class (ReST) utility function to address the limitations of pre-softmax and post-softmax scores. Extensive experiments across 12 popular networks on the ImageNet validation set demonstrate the effectiveness of ShapleyCAM and its variants. Our findings not only advance CAM explainability but also bridge the gap between heuristic-driven CAM methods and compute-intensive Shapley value-based methods. The code is available at \url{https://github.com/caihuaiguang/pytorch-shapley-cam}.
中文摘要:本研究提出CRG理论框架和ShapleyCAM方法,通过将神经网络预测建模为合作博弈,并采用梯度-海森矩阵近似计算沙普利值,显著提升了类激活映射技术的可解释性。
English Summary: The study introduces the CRG theoretical framework and ShapleyCAM method to enhance the explainability of CAM techniques by modeling neural network predictions as cooperative games and using gradient-Hessian approximations for efficient Shapley value computation.

Authors:Daojun Liang, Haixia Zhang, Dongfeng Yuan
Title: Progressive Supervision via Label Decomposition: An Long-Term and Large-Scale Wireless Traffic Forecasting Method
Abstract:
Long-term and Large-scale Wireless Traffic Forecasting (LL-WTF) is pivotal for strategic network management and comprehensive planning on a macro scale. However, LL-WTF poses greater challenges than short-term ones due to the pronounced non-stationarity of extended wireless traffic and the vast number of nodes distributed at the city scale. To cope with this, we propose a Progressive Supervision method based on Label Decomposition (PSLD). Specifically, we first introduce a Random Subgraph Sampling (RSS) algorithm designed to sample a tractable subset from large-scale traffic data, thereby enabling efficient network training. Then, PSLD employs label decomposition to obtain multiple easy-to-learn components, which are learned progressively at shallow layers and combined at deep layers to effectively cope with the non-stationary problem raised by LL-WTF tasks. Finally, we compare the proposed method with various state-of-the-art (SOTA) methods on three large-scale WT datasets. Extensive experimental results demonstrate that the proposed PSLD significantly outperforms existing methods, with an average 2%, 4%, and 11% performance improvement on three WT datasets, respectively. In addition, we built an open source library for WT forecasting (WTFlib) to facilitate related research, which contains numerous SOTA methods and provides a strong benchmark.Experiments can be reproduced through https://github.com/Anoise/WTFlib.
中文: 提出的基于标签分解的渐进监督方法(PSLD)通过随机子图采样和分层学习策略,有效解决了长期大规模无线流量预测中的非平稳性问题,在多个数据集上显著超越了现有最优方法。
English: The proposed Progressive Supervision with Label Decomposition (PSLD) method effectively addresses long-term wireless traffic forecasting challenges by employing random subgraph sampling and progressive learning, achieving significant performance improvements over state-of-the-art methods across multiple datasets.

Authors:Yunlong Tang, Junjia Guo, Pinxin Liu, Zhiyuan Wang, Hang Hua, Jia-Xing Zhong, Yunzhong Xiao, Chao Huang, Luchuan Song, Susan Liang, Yizhi Song, Liu He, Jing Bi, Mingqian Feng, Xinyang Li, Zeliang Zhang, Chenliang Xu
Title: Generative AI for Cel-Animation: A Survey
Abstract:
Traditional Celluloid (Cel) Animation production pipeline encompasses multiple essential steps, including storyboarding, layout design, keyframe animation, inbetweening, and colorization, which demand substantial manual effort, technical expertise, and significant time investment. These challenges have historically impeded the efficiency and scalability of Cel-Animation production. The rise of generative artificial intelligence (GenAI), encompassing large language models, multimodal models, and diffusion models, offers innovative solutions by automating tasks such as inbetween frame generation, colorization, and storyboard creation. This survey explores how GenAI integration is revolutionizing traditional animation workflows by lowering technical barriers, broadening accessibility for a wider range of creators through tools like AniDoc, ToonCrafter, and AniSora, and enabling artists to focus more on creative expression and artistic innovation. Despite its potential, challenges like visual consistency, stylistic coherence, and ethical considerations persist. Additionally, this paper explores future directions and advancements in AI-assisted animation. For further exploration and resources, please visit our GitHub repository: https://github.com/yunlong10/Awesome-AI4Animation
中文摘要:生成式人工智能通过自动完成中间帧生成、上色等繁重工序,正在革新传统赛璐珞动画制作流程,不仅大幅提升效率、降低技术门槛,还让艺术家能更专注于创意表达,尽管在视觉一致性与伦理规范方面仍存在挑战。
English Summary: Generative AI is revolutionizing traditional Cel-Animation by automating labor-intensive processes like inbetweening and colorization, thereby enhancing efficiency and accessibility while allowing artists to focus on creativity, despite ongoing challenges with visual consistency and ethical concerns.

Authors:Yolo Yunlong Tang, Junjia Guo, Pinxin Liu, Zhiyuan Wang, Hang Hua, Jia-Xing Zhong, Yunzhong Xiao, Chao Huang, Luchuan Song, Susan Liang, Yizhi Song, Liu He, Jing Bi, Mingqian Feng, Xinyang Li, Zeliang Zhang, Chenliang Xu
Title: Generative AI for Cel-Animation: A Survey
Abstract:
Traditional Celluloid (Cel) Animation production pipeline encompasses multiple essential steps, including storyboarding, layout design, keyframe animation, inbetweening, and colorization, which demand substantial manual effort, technical expertise, and significant time investment. These challenges have historically impeded the efficiency and scalability of Cel-Animation production. The rise of generative artificial intelligence (GenAI), encompassing large language models, multimodal models, and diffusion models, offers innovative solutions by automating tasks such as inbetween frame generation, colorization, and storyboard creation. This survey explores how GenAI integration is revolutionizing traditional animation workflows by lowering technical barriers, broadening accessibility for a wider range of creators through tools like AniDoc, ToonCrafter, and AniSora, and enabling artists to focus more on creative expression and artistic innovation. Despite its potential, challenges like visual consistency, stylistic coherence, and ethical considerations persist. Additionally, this paper explores future directions and advancements in AI-assisted animation. For further exploration and resources, please visit our GitHub repository: https://github.com/yunlong10/Awesome-AI4Animation
中文摘要:生成式人工智能通过自动完成中间帧生成、上色等繁重工序,正在革新传统赛璐珞动画制作流程,不仅大幅提升效率、降低技术门槛,还让艺术家能更专注于创意表达,尽管在视觉一致性与伦理规范方面仍存在挑战。
English Summary: Generative AI is revolutionizing traditional Cel-Animation by automating labor-intensive processes like inbetweening and colorization, thereby enhancing efficiency and accessibility while allowing artists to focus on creativity, despite ongoing challenges with visual consistency and ethical concerns.

Authors:Mills Staylor, Amirreza Dolatpour Fathkouhi, Md Khairul Islam, Kaleigh O'Hara, Ryan Ghiles Goudjil, Geoffrey Fox, Judy Fox
Title: Scalable Cosmic AI Inference using Cloud Serverless Computing with FMI
Abstract:
Large-scale astronomical image data processing and prediction is essential for astronomers, providing crucial insights into celestial objects, the universe's history, and its evolution. While modern deep learning models offer high predictive accuracy, they often demand substantial computational resources, making them resource-intensive and limiting accessibility. We introduce the Cloud-based Astronomy Inference (CAI) framework to address these challenges. This scalable solution integrates pre-trained foundation models with serverless cloud infrastructure through a Function-as-a-Service (FaaS) Message Interface (FMI). CAI enables efficient and scalable inference on astronomical images without extensive hardware. Using a foundation model for redshift prediction as a case study, our extensive experiments cover user devices, HPC (High-Performance Computing) servers, and Cloud. CAI's significant scalability improvement on large data sizes provides an accessible and effective tool for the astronomy community. The code is accessible at https://github.com/UVA-MLSys/AI-for-Astronomy.
中文: CAI框架利用无服务器云基础设施和预训练基础模型,为天文学图像处理提供了高度可扩展、成本低廉且高效的解决方案,其速度和吞吐量远超传统高性能计算方法。
English: The CAI framework leverages serverless cloud infrastructure and pre-trained foundation models to deliver a highly scalable, cost-effective, and efficient solution for processing astronomical images, significantly outperforming traditional HPC methods in speed and throughput.

Authors:Mills Staylor, Amirreza Dolatpour Fathkouhi, Md Khairul Islam, Kaleigh O'Hara, Ryan Ghiles Goudjil, Geoffrey Fox, Judy Fox
Title: Scalable Cosmic AI Inference using Cloud Serverless Computing
Abstract:
Large-scale astronomical image data processing and prediction are essential for astronomers, providing crucial insights into celestial objects, the universe's history, and its evolution. While modern deep learning models offer high predictive accuracy, they often demand substantial computational resources, making them resource-intensive and limiting accessibility. We introduce the Cloud-based Astronomy Inference (CAI) framework to address these challenges. This scalable solution integrates pre-trained foundation models with serverless cloud infrastructure through a Function-as-a-Service (FaaS). CAI enables efficient and scalable inference on astronomical images without extensive hardware. Using a foundation model for redshift prediction as a case study, our extensive experiments cover user devices, HPC (High-Performance Computing) servers, and Cloud. Using redshift prediction with the AstroMAE model demonstrated CAI's scalability and efficiency, achieving inference on a 12.6 GB dataset in only 28 seconds compared to 140.8 seconds on HPC GPUs and 1793 seconds on HPC CPUs. CAI also achieved significantly higher throughput, reaching 18.04 billion bits per second (bps), and maintained near-constant inference times as data sizes increased, all at minimal computational cost (under $5 per experiment). We also process large-scale data up to 1 TB to show CAI's effectiveness at scale. CAI thus provides a highly scalable, accessible, and cost-effective inference solution for the astronomy community. The code is accessible at https://github.com/UVA-MLSys/AI-for-Astronomy.
中文: CAI框架利用无服务器云基础设施和预训练基础模型,为天文学图像处理提供了高度可扩展、成本低廉且高效的解决方案,其速度和吞吐量远超传统高性能计算方法。
English: The CAI framework leverages serverless cloud infrastructure and pre-trained foundation models to deliver a highly scalable, cost-effective, and efficient solution for processing astronomical images, significantly outperforming traditional HPC methods in speed and throughput.

Authors:Nirit Alkalay, Roy Orfaig, Ben-Zion Bobrovsky
Title: NextStop: An Improved Tracker For Panoptic LIDAR Segmentation Data
Abstract:
4D panoptic LiDAR segmentation is essential for scene understanding in autonomous driving and robotics, combining semantic and instance segmentation with temporal consistency. Current methods, like 4D-PLS and 4D-STOP, use a tracking-by-detection methodology, employing deep learning networks to perform semantic and instance segmentation on each frame. To maintain temporal consistency, large-size instances detected in the current frame are compared and associated with instances within a temporal window that includes the current and preceding frames. However, their reliance on short-term instance detection, lack of motion estimation, and exclusion of small-sized instances lead to frequent identity switches and reduced tracking performance. We address these issues with the NextStop1 tracker, which integrates Kalman filter-based motion estimation, data association, and lifespan management, along with a tracklet state concept to improve prioritization. Evaluated using the LiDAR Segmentation and Tracking Quality (LSTQ) metric on the SemanticKITTI validation set, NextStop demonstrated enhanced tracking performance, particularly for small-sized objects like people and bicyclists, with fewer ID switches, earlier tracking initiation, and improved reliability in complex environments. The source code is available at https://github.com/AIROTAU/NextStop
中文摘要:NextStop跟踪器通过集成运动估计和生命周期管理,改进了4D全景激光雷达分割,在复杂环境中对小尺寸物体实现了更少的身份切换和更优的跟踪性能。
English Summary: The NextStop tracker improves 4D panoptic LiDAR segmentation by incorporating motion estimation and lifespan management, achieving better tracking performance with fewer identity switches for small objects in complex environments.

Authors:Gent Wu
Title: Powerful Design of Small Vision Transformer on CIFAR10
Abstract:
Vision Transformers (ViTs) have demonstrated remarkable success on large-scale datasets, but their performance on smaller datasets often falls short of convolutional neural networks (CNNs). This paper explores the design and optimization of Tiny ViTs for small datasets, using CIFAR-10 as a benchmark. We systematically evaluate the impact of data augmentation, patch token initialization, low-rank compression, and multi-class token strategies on model performance. Our experiments reveal that low-rank compression of queries in Multi-Head Latent Attention (MLA) incurs minimal performance loss, indicating redundancy in ViTs. Additionally, introducing multiple CLS tokens improves global representation capacity, boosting accuracy. These findings provide a comprehensive framework for optimizing Tiny ViTs, offering practical insights for efficient and effective designs. Code is available at https://github.com/erow/PoorViTs.
中文: 本文通过评估低秩压缩和多类别令牌等策略,系统优化了适用于小数据集的微型视觉变换器,发现压缩查询带来的性能损失极小,且引入多个CLS令牌能提升全局表示能力与准确率。
English: This paper systematically optimizes Tiny Vision Transformers for small datasets by evaluating strategies like low-rank compression and multi-class tokens, finding minimal performance loss with compressed queries and enhanced accuracy through multiple CLS tokens.

Authors:Jiayu Guo, Yu Guo, Martha Li, Songtao Tan
Title: FLAME: Financial Large-Language Model Assessment and Metrics Evaluation
Abstract:
LLMs have revolutionized NLP and demonstrated potential across diverse domains. More and more financial LLMs have been introduced for finance-specific tasks, yet comprehensively assessing their value is still challenging. In this paper, we introduce FLAME, a comprehensive financial LLMs evaluation system in Chinese, which includes two core evaluation benchmarks: FLAME-Cer and FLAME-Sce. FLAME-Cer covers 14 types of authoritative financial certifications, including CPA, CFA, and FRM, with a total of approximately 16,000 carefully selected questions. All questions have been manually reviewed to ensure accuracy and representativeness. FLAME-Sce consists of 10 primary core financial business scenarios, 21 secondary financial business scenarios, and a comprehensive evaluation set of nearly 100 tertiary financial application tasks. We evaluate 6 representative LLMs, including GPT-4o, GLM-4, ERNIE-4.0, Qwen2.5, XuanYuan3, and the latest Baichuan4-Finance, revealing Baichuan4-Finance excels other LLMs in most tasks. By establishing a comprehensive and professional evaluation system, FLAME facilitates the advancement of financial LLMs in Chinese contexts. Instructions for participating in the evaluation are available on GitHub: https://github.com/FLAME-ruc/FLAME.
中文:本文提出了FLAME,一个全面的中文金融大模型评估系统,包含金融认证与业务场景两大基准测试,评估显示百川4-金融在多数任务中表现最优。
English: This paper introduces FLAME, a comprehensive Chinese financial LLM evaluation system with two benchmarks—FLAME-Cer for financial certifications and FLAME-Sce for business scenarios—revealing Baichuan4-Finance's superior performance among tested models.

Authors:Seul Lee, Karsten Kreis, Srimukh Prasad Veccham, Meng Liu, Danny Reidenbach, Yuxing Peng, Saee Paliwal, Weili Nie, Arash Vahdat
Title: GenMol: A Drug Discovery Generalist with Discrete Diffusion
Abstract:
Drug discovery is a complex process that involves multiple stages and tasks. However, existing molecular generative models can only tackle some of these tasks. We present Generalist Molecular generative model (GenMol), a versatile framework that uses only a single discrete diffusion model to handle diverse drug discovery scenarios. GenMol generates Sequential Attachment-based Fragment Embedding (SAFE) sequences through non-autoregressive bidirectional parallel decoding, thereby allowing the utilization of a molecular context that does not rely on the specific token ordering while having better sampling efficiency. GenMol uses fragments as basic building blocks for molecules and introduces fragment remasking, a strategy that optimizes molecules by regenerating masked fragments, enabling effective exploration of chemical space. We further propose molecular context guidance (MCG), a guidance method tailored for masked discrete diffusion of GenMol. GenMol significantly outperforms the previous GPT-based model in de novo generation and fragment-constrained generation, and achieves state-of-the-art performance in goal-directed hit generation and lead optimization. These results demonstrate that GenMol can tackle a wide range of drug discovery tasks, providing a unified and versatile approach for molecular design. Our code is available at https://github.com/NVIDIA-Digital-Bio/genmol.
中文:GenMol是一种通用的分子生成框架,它采用单一离散扩散模型,通过基于片段的构建策略和创新的片段重掩蔽与分子上下文引导技术,在多种药物发现任务中超越现有模型,为分子设计提供了统一且高效的解决方案。
English: GenMol is a versatile molecular generative framework that uses a single discrete diffusion model with fragment-based building blocks and innovative strategies like fragment remasking and molecular context guidance to outperform previous models across diverse drug discovery tasks, offering a unified approach for molecular design.

Authors:Julius Berner, Lorenz Richter, Marcin Sendera, Jarrid Rector-Brooks, Nikolay Malkin
Title: From discrete-time policies to continuous-time diffusion samplers: Asymptotic equivalences and faster training
Abstract:
We study the problem of training neural stochastic differential equations, or diffusion models, to sample from a Boltzmann distribution without access to target samples. Existing methods for training such models enforce time-reversal of the generative and noising processes, using either differentiable simulation or off-policy reinforcement learning (RL). We prove equivalences between families of objectives in the limit of infinitesimal discretization steps, linking entropic RL methods (GFlowNets) with continuous-time objects (partial differential equations and path space measures). We further show that an appropriate choice of coarse time discretization during training allows greatly improved sample efficiency and the use of time-local objectives, achieving competitive performance on standard sampling benchmarks with reduced computational cost.
中文摘要:本研究提出了一种无需目标数据即可训练神经扩散模型从玻尔兹曼分布中采样的方法,证明了熵强化学习与连续时间目标之间的等价性,并通过优化时间离散化实现了更高的采样效率。
English Summary: This research introduces a method for training neural diffusion models to sample from Boltzmann distributions without target data, demonstrating equivalence between entropic reinforcement learning and continuous-time objectives while achieving improved efficiency through optimized time discretization.

Authors:Haichao Liu, Ruoyu Yao, Wenru Liu, Zhenmin Huang, Shaojie Shen, Jun Ma
Title: CoDriveVLM: VLM-Enhanced Urban Cooperative Dispatching and Motion Planning for Future Autonomous Mobility on Demand Systems
Abstract:
The increasing demand for flexible and efficient urban transportation solutions has spotlighted the limitations of traditional Demand Responsive Transport (DRT) systems, particularly in accommodating diverse passenger needs and dynamic urban environments. Autonomous Mobility-on-Demand (AMoD) systems have emerged as a promising alternative, leveraging connected and autonomous vehicles (CAVs) to provide responsive and adaptable services. However, existing methods primarily focus on either vehicle scheduling or path planning, which often simplify complex urban layouts and neglect the necessity for simultaneous coordination and mutual avoidance among CAVs. This oversimplification poses significant challenges to the deployment of AMoD systems in real-world scenarios. To address these gaps, we propose CoDriveVLM, a novel framework that integrates high-fidelity simultaneous dispatching and cooperative motion planning for future AMoD systems. Our method harnesses Vision-Language Models (VLMs) to enhance multi-modality information processing, and this enables comprehensive dispatching and collision risk evaluation. The VLM-enhanced CAV dispatching coordinator is introduced to effectively manage complex and unforeseen AMoD conditions, thus supporting efficient scheduling decision-making. Furthermore, we propose a scalable decentralized cooperative motion planning method via consensus alternating direction method of multipliers (ADMM) focusing on collision risk evaluation and decentralized trajectory optimization. Simulation results demonstrate the feasibility and robustness of CoDriveVLM in various traffic conditions, showcasing its potential to significantly improve the fidelity and effectiveness of AMoD systems in future urban transportation networks. The code is available at https://github.com/henryhcliu/CoDriveVLM.git.
中文总结:CoDriveVLM框架通过结合视觉语言模型和分散式协同运动规划,解决了自主按需出行系统的调度与路径规划分离问题,显著提升了复杂城市交通环境中的系统适应性和可靠性。
English Summary: The CoDriveVLM framework addresses limitations in autonomous mobility systems by integrating high-fidelity dispatching with cooperative motion planning using vision-language models and decentralized optimization, demonstrating improved performance in diverse urban scenarios.

Authors:Leonardo Delfino, Domenico Erriquez, Silvio Martinico, Franco Maria Nardini, Cosimo Rulli, Rossano Venturini
Title: kANNolo: Sweet and Smooth Approximate k-Nearest Neighbors Search
Abstract:
Approximate Nearest Neighbors (ANN) search is a crucial task in several applications like recommender systems and information retrieval. Current state-of-the-art ANN libraries, although being performance-oriented, often lack modularity and ease of use. This translates into them not being fully suitable for easy prototyping and testing of research ideas, an important feature to enable. We address these limitations by introducing kANNolo, a novel research-oriented ANN library written in Rust and explicitly designed to combine usability with performance effectively. kANNolo is the first ANN library that supports dense and sparse vector representations made available on top of different similarity measures, e.g., euclidean distance and inner product. Moreover, it also supports vector quantization techniques, e.g., Product Quantization, on top of the indexing strategies implemented. These functionalities are managed through Rust traits, allowing shared behaviors to be handled abstractly. This abstraction ensures flexibility and facilitates an easy integration of new components. In this work, we detail the architecture of kANNolo and demonstrate that its flexibility does not compromise performance. The experimental analysis shows that kANNolo achieves state-of-the-art performance in terms of speed-accuracy trade-off while allowing fast and easy prototyping, thus making kANNolo a valuable tool for advancing ANN research. Source code available on GitHub: https://github.com/TusKANNy/kannolo.
Chinese: kANNolo 是一种基于 Rust 语言的新型面向研究的近似最近邻搜索库,它有效结合了可用性与高性能,支持多种相似性度量的稠密和稀疏向量表示,并在速度-精度权衡方面达到了先进水平。
English: kANNolo is a new research-oriented Approximate Nearest Neighbors library written in Rust that effectively combines usability with performance, supporting both dense and sparse vectors with various similarity measures while achieving state-of-the-art speed-accuracy trade-offs.

Authors:Steffen Dereich, Arnulf Jentzen, Adrian Riekert
Title: Averaged Adam accelerates stochastic optimization in the training of deep neural network approximations for partial differential equation and optimal control problems
Abstract:
Deep learning methods - usually consisting of a class of deep neural networks (DNNs) trained by a stochastic gradient descent (SGD) optimization method - are nowadays omnipresent in data-driven learning problems as well as in scientific computing tasks such as optimal control (OC) and partial differential equation (PDE) problems. In practically relevant learning tasks, often not the plain-vanilla standard SGD optimization method is employed to train the considered class of DNNs but instead more sophisticated adaptive and accelerated variants of the standard SGD method such as the popular Adam optimizer are used. Inspired by the classical Polyak-Ruppert averaging approach, in this work we apply averaged variants of the Adam optimizer to train DNNs to approximately solve exemplary scientific computing problems in the form of PDEs and OC problems. We test the averaged variants of Adam in a series of learning problems including physics-informed neural network (PINN), deep backward stochastic differential equation (deep BSDE), and deep Kolmogorov approximations for PDEs (such as heat, Black-Scholes, Burgers, and Allen-Cahn PDEs), including DNN approximations for OC problems, and including DNN approximations for image classification problems (ResNet for CIFAR-10). In each of the numerical examples the employed averaged variants of Adam outperform the standard Adam and the standard SGD optimizers, particularly, in the situation of the scientific machine learning problems. The Python source codes for the numerical experiments associated to this work can be found on GitHub at https://github.com/deeplearningmethods/averaged-adam.
中文: 本研究证明,在训练深度神经网络解决偏微分方程和最优控制等科学计算问题时,采用平均化变体的Adam优化器比标准Adam和随机梯度下降方法表现更优。
English: This study demonstrates that averaged variants of the Adam optimizer outperform standard Adam and SGD methods in training deep neural networks for scientific computing tasks like PDEs and optimal control problems.

Authors:Oindrila Saha, Logan Lawrence, Grant Van Horn, Subhransu Maji
Title: Generate, Transduct, Adapt: Iterative Transduction with VLMs
Abstract:
Transductive zero-shot learning with vision-language models leverages image-image similarities within the dataset to achieve better classification accuracy compared to the inductive setting. However, there is little work that explores the structure of the language space in this context. We propose GTA-CLIP, a novel technique that incorporates supervision from language models for joint transduction in language and vision spaces. Our approach is iterative and consists of three steps: (i) incrementally exploring the attribute space by querying language models, (ii) an attribute-augmented transductive inference procedure, and (iii) fine-tuning the language and vision encoders based on inferred labels within the dataset. Through experiments with CLIP encoders, we demonstrate that GTA-CLIP, yields an average performance improvement of 8.6% and 3.7% across 12 datasets and 3 encoders, over CLIP and transductive CLIP respectively in the zero-shot setting. We also observe similar improvements in a few-shot setting. We present ablation studies that demonstrate the value of each step and visualize how the vision and language spaces evolve over iterations driven by the transductive learning.
中文摘要:GTA-CLIP通过语言模型监督实现语言与视觉空间的联合转导,采用迭代式属性增强和编码器微调方法,在零样本学习中显著提升了分类准确率。
English Summary: GTA-CLIP introduces iterative language-vision transduction with language model supervision, achieving significant accuracy improvements in zero-shot learning by augmenting attributes and fine-tuning encoders.

Authors:Hongruixuan Chen, Jian Song, Olivier Dietrich, Clifford Broni-Bediako, Weihao Xuan, Junjue Wang, Xinlei Shao, Yimin Wei, Junshi Xia, Cuiling Lan, Konrad Schindler, Naoto Yokoya
Title: BRIGHT: A globally distributed multimodal building damage assessment dataset with very-high-resolution for all-weather disaster response
Abstract:
Disaster events occur around the world and cause significant damage to human life and property. Earth observation (EO) data enables rapid and comprehensive building damage assessment (BDA), an essential capability in the aftermath of a disaster to reduce human casualties and to inform disaster relief efforts. Recent research focuses on the development of AI models to achieve accurate mapping of unseen disaster events, mostly using optical EO data. However, solutions based on optical data are limited to clear skies and daylight hours, preventing a prompt response to disasters. Integrating multimodal (MM) EO data, particularly the combination of optical and SAR imagery, makes it possible to provide all-weather, day-and-night disaster responses. Despite this potential, the development of robust multimodal AI models has been constrained by the lack of suitable benchmark datasets. In this paper, we present a BDA dataset using veRy-hIGH-resoluTion optical and SAR imagery (BRIGHT) to support AI-based all-weather disaster response. To the best of our knowledge, BRIGHT is the first open-access, globally distributed, event-diverse MM dataset specifically curated to support AI-based disaster response. It covers five types of natural disasters and two types of man-made disasters across 14 regions worldwide, with a particular focus on developing countries where external assistance is most needed. The optical and SAR imagery in BRIGHT, with a spatial resolution between 0.3-1 meters, provides detailed representations of individual buildings, making it ideal for precise BDA. In our experiments, we have tested seven advanced AI models trained with our BRIGHT to validate the transferability and robustness. The dataset and code are available at https://github.com/ChenHongruixuan/BRIGHT. BRIGHT also serves as the official dataset for the 2025 IEEE GRSS Data Fusion Contest.
中文:BRIGHT数据集通过提供高分辨率光学与SAR影像,支持全天候建筑物损毁评估,克服了单一光学数据的局限,为基于人工智能的全球多样化灾害响应提供了关键数据支撑。
English: The BRIGHT dataset provides high-resolution optical and SAR imagery to enable all-weather building damage assessment, addressing the limitations of optical-only methods and supporting AI-driven disaster response across diverse global events.

Authors:Hongruixuan Chen, Jian Song, Olivier Dietrich, Clifford Broni-Bediako, Weihao Xuan, Junjue Wang, Xinlei Shao, Yimin Wei, Junshi Xia, Cuiling Lan, Konrad Schindler, Naoto Yokoya
Title: BRIGHT: A globally distributed multimodal building damage assessment dataset with very-high-resolution for all-weather disaster response
Abstract:
Disaster events occur around the world and cause significant damage to human life and property. Earth observation (EO) data enables rapid and comprehensive building damage assessment (BDA), an essential capability in the aftermath of a disaster to reduce human casualties and to inform disaster relief efforts. Recent research focuses on the development of AI models to achieve accurate mapping of unseen disaster events, mostly using optical EO data. However, solutions based on optical data are limited to clear skies and daylight hours, preventing a prompt response to disasters. Integrating multimodal (MM) EO data, particularly the combination of optical and SAR imagery, makes it possible to provide all-weather, day-and-night disaster responses. Despite this potential, the development of robust multimodal AI models has been constrained by the lack of suitable benchmark datasets. In this paper, we present a BDA dataset using veRy-hIGH-resoluTion optical and SAR imagery (BRIGHT) to support AI-based all-weather disaster response. To the best of our knowledge, BRIGHT is the first open-access, globally distributed, event-diverse MM dataset specifically curated to support AI-based disaster response. It covers five types of natural disasters and two types of man-made disasters across 14 regions worldwide, with a particular focus on developing countries where external assistance is most needed. The optical and SAR imagery in BRIGHT, with a spatial resolution between 0.3-1 meters, provides detailed representations of individual buildings, making it ideal for precise BDA. In our experiments, we have tested seven advanced AI models trained with our BRIGHT to validate the transferability and robustness. The dataset and code are available at https://github.com/ChenHongruixuan/BRIGHT. BRIGHT also serves as the official dataset for the 2025 IEEE GRSS Data Fusion Contest.
中文:BRIGHT数据集通过提供高分辨率光学与SAR影像,支持全天候建筑物损毁评估,克服了单一光学数据的局限,为基于人工智能的全球多样化灾害响应提供了关键数据支撑。
English: The BRIGHT dataset provides high-resolution optical and SAR imagery to enable all-weather building damage assessment, addressing the limitations of optical-only methods and supporting AI-driven disaster response across diverse global events.

Authors:David Bojanić, Stefanie Wuhrer, Tomislav Petković, Tomislav Pribanić
Title: Pose-independent 3D Anthropometry from Sparse Data
Abstract:
3D digital anthropometry is the study of estimating human body measurements from 3D scans. Precise body measurements are important health indicators in the medical industry, and guiding factors in the fashion, ergonomic and entertainment industries. The measuring protocol consists of scanning the whole subject in the static A-pose, which is maintained without breathing or movement during the scanning process. However, the A-pose is not easy to maintain during the whole scanning process, which can last even up to a couple of minutes. This constraint affects the final quality of the scan, which in turn affects the accuracy of the estimated body measurements obtained from methods that rely on dense geometric data. Additionally, this constraint makes it impossible to develop a digital anthropometry method for subjects unable to assume the A-pose, such as those with injuries or disabilities. We propose a method that can obtain body measurements from sparse landmarks acquired in any pose. We make use of the sparse landmarks of the posed subject to create pose-independent features, and train a network to predict the body measurements as taken from the standard A-pose. We show that our method achieves comparable results to competing methods that use dense geometry in the standard A-pose, but has the capability of estimating the body measurements from any pose using sparse landmarks only. Finally, we address the lack of open-source 3D anthropometry methods by making our method available to the research community at https://github.com/DavidBoja/pose-independent-anthropometry.
中文: 本研究提出了一种姿态无关的人体测量方法,利用任意姿态下的稀疏关键点估算身体尺寸,其精度与标准A姿态方法相当,可适用于行动不便者,并将该方法开源供研究社区使用。
English: This study introduces a pose-independent method for estimating human body measurements using sparse landmarks from any pose, achieving accuracy comparable to standard A-pose techniques while enabling applications for individuals with mobility limitations and making the approach publicly available.

Authors:Kevin Mancini, Islem Rekik
Title: DeltaGNN: Graph Neural Network with Information Flow Control
Abstract:
Graph Neural Networks (GNNs) are popular deep learning models designed to process graph-structured data through recursive neighborhood aggregations in the message passing process. When applied to semi-supervised node classification, the message-passing enables GNNs to understand short-range spatial interactions, but also causes them to suffer from over-smoothing and over-squashing. These challenges hinder model expressiveness and prevent the use of deeper models to capture long-range node interactions (LRIs) within the graph. Popular solutions for LRIs detection are either too expensive to process large graphs due to high time complexity or fail to generalize across diverse graph structures. To address these limitations, we propose a mechanism called \emph{information flow control}, which leverages a novel connectivity measure, called \emph{information flow score}, to address over-smoothing and over-squashing with linear computational overhead, supported by theoretical evidence. Finally, to prove the efficacy of our methodology we design DeltaGNN, the first scalable and generalizable approach for detecting long-range and short-range interactions. We benchmark our model across 10 real-world datasets, including graphs with varying sizes, topologies, densities, and homophilic ratios, showing superior performance with limited computational complexity. The implementation of the proposed methods are publicly available at https://github.com/basiralab/DeltaGNN.
Chinese: 图神经网络在捕捉长程交互时面临过度平滑和挤压的挑战,而提出的DeltaGNN通过信息流控制机制,以线性计算成本有效解决了这些问题,并在多样化图结构中展现出优越性能。
English: Graph Neural Networks face challenges like over-smoothing and over-squashing that limit their ability to capture long-range interactions, but the proposed DeltaGNN with information flow control effectively addresses these issues while maintaining computational efficiency across diverse datasets.

Authors:Sauda Adiv Hanum, Ashim Dey, Muhammad Ashad Kabir
Title: An Attention-Guided Deep Learning Approach for Classifying 39 Skin Lesion Types
Abstract:
The skin, as the largest organ of the human body, is vulnerable to a diverse array of conditions collectively known as skin lesions, which encompass various dermatoses. Diagnosing these lesions presents significant challenges for medical practitioners due to the subtle visual differences that are often imperceptible to the naked eye. While not all skin lesions are life-threatening, certain types can act as early indicators of severe diseases, including skin cancers, underscoring the critical need for timely and accurate diagnostic methods. Deep learning algorithms have demonstrated remarkable potential in facilitating the early detection and prognosis of skin lesions. This study advances the field by curating a comprehensive and diverse dataset comprising 39 categories of skin lesions, synthesized from five publicly available datasets. Using this dataset, the performance of five state-of-the-art deep learning models -- MobileNetV2, Xception, InceptionV3, EfficientNetB1, and Vision Transformer - is rigorously evaluated. To enhance the accuracy and robustness of these models, attention mechanisms such as the Efficient Channel Attention (ECA) and the Convolutional Block Attention Module (CBAM) are incorporated into their architectures. Comprehensive evaluation across multiple performance metrics reveals that the Vision Transformer model integrated with CBAM outperforms others, achieving an accuracy of 93.46%, precision of 94%, recall of 93%, F1-score of 93%, and specificity of 93.67%. These results underscore the significant potential of the proposed system in supporting medical professionals with accurate and efficient prognostic tools for diagnosing a broad spectrum of skin lesions. The dataset and code used in this study can be found at https://github.com/akabircs/Skin-Lesions-Classification.
Chinese: 本研究通过将注意力机制与五种先进模型相结合,开发了一个用于诊断39种皮肤病变的深度学习系统,结果表明集成CBAM的Vision Transformer模型表现最佳,准确率达93.46%。
English: This study develops a deep learning system for diagnosing 39 types of skin lesions by integrating attention mechanisms with five advanced models, demonstrating that the Vision Transformer with CBAM achieves superior performance with 93.46% accuracy.

Authors:Kuan Liu, Zongyuan Ying, Jie Jin, Dongyan Li, Ping Huang, Wenjian Wu, Zhe Chen, Jin Qi, Yong Lu, Lianfu Deng, Bo Chen
Title: Swin-X2S: Reconstructing 3D Shape from 2D Biplanar X-ray with Swin Transformers
Abstract:
The conversion from 2D X-ray to 3D shape holds significant potential for improving diagnostic efficiency and safety. However, existing reconstruction methods often rely on hand-crafted features, manual intervention, and prior knowledge, resulting in unstable shape errors and additional processing costs. In this paper, we introduce Swin-X2S, an end-to-end deep learning method for directly reconstructing 3D segmentation and labeling from 2D biplanar orthogonal X-ray images. Swin-X2S employs an encoder-decoder architecture: the encoder leverages 2D Swin Transformer for X-ray information extraction, while the decoder employs 3D convolution with cross-attention to integrate structural features from orthogonal views. A dimension-expanding module is introduced to bridge the encoder and decoder, ensuring a smooth conversion from 2D pixels to 3D voxels. We evaluate proposed method through extensive qualitative and quantitative experiments across nine publicly available datasets covering four anatomies (femur, hip, spine, and rib), with a total of 54 categories. Significant improvements over previous methods have been observed not only in the segmentation and labeling metrics but also in the clinically relevant parameters that are of primary concern in practical applications, which demonstrates the promise of Swin-X2S to provide an effective option for anatomical shape reconstruction in clinical scenarios. Code implementation is available at: \url{https://github.com/liukuan5625/Swin-X2S}.
中文: Swin-X2S是一种端到端的深度学习方法,可直接从二维X射线图像重建三维解剖结构,在多个数据集的细分和临床指标上均表现出卓越性能。
English: Swin-X2S is an end-to-end deep learning method that directly reconstructs 3D anatomical shapes from 2D X-ray images, achieving superior performance in segmentation and clinical metrics across multiple datasets.

Authors:Naval Kishore Mehta, Arvind, Himanshu Kumar, Abeer Banerjee, Sumeet Saurav, Sanjay Singh
Title: A Multimodal Dataset for Enhancing Industrial Task Monitoring and Engagement Prediction
Abstract:
Detecting and interpreting operator actions, engagement, and object interactions in dynamic industrial workflows remains a significant challenge in human-robot collaboration research, especially within complex, real-world environments. Traditional unimodal methods often fall short of capturing the intricacies of these unstructured industrial settings. To address this gap, we present a novel Multimodal Industrial Activity Monitoring (MIAM) dataset that captures realistic assembly and disassembly tasks, facilitating the evaluation of key meta-tasks such as action localization, object interaction, and engagement prediction. The dataset comprises multi-view RGB, depth, and Inertial Measurement Unit (IMU) data collected from 22 sessions, amounting to 290 minutes of untrimmed video, annotated in detail for task performance and operator behavior. Its distinctiveness lies in the integration of multiple data modalities and its emphasis on real-world, untrimmed industrial workflows-key for advancing research in human-robot collaboration and operator monitoring. Additionally, we propose a multimodal network that fuses RGB frames, IMU data, and skeleton sequences to predict engagement levels during industrial tasks. Our approach improves the accuracy of recognizing engagement states, providing a robust solution for monitoring operator performance in dynamic industrial environments. The dataset and code can be accessed from https://github.com/navalkishoremehta95/MIAM/.
Chinese: 该研究提出了一种新颖的多模态工业活动监测(MIAM)数据集及多模态网络,旨在提升复杂工业流程中操作员行为、参与度和物体交互的检测能力,从而增强人机协作的准确性。
English: The study introduces a novel Multimodal Industrial Activity Monitoring (MIAM) dataset and a multimodal network to enhance the detection of operator actions, engagement, and object interactions in complex industrial workflows, improving accuracy in human-robot collaboration.

Authors:Ziheng Wu, Zhenghao Chen, Ruipu Luo, Can Zhang, Yuan Gao, Zhentao He, Xian Wang, Haoran Lin, Minghui Qiu
Title: Valley2: Exploring Multimodal Models with Scalable Vision-Language Design
Abstract:
Recently, vision-language models have made remarkable progress, demonstrating outstanding capabilities in various tasks such as image captioning and video understanding. We introduce Valley2, a novel multimodal large language model designed to enhance performance across all domains and extend the boundaries of practical applications in e-commerce and short video scenarios. Notably, Valley2 achieves state-of-the-art (SOTA) performance on e-commerce benchmarks, surpassing open-source models of similar size by a large margin (79.66 vs. 72.76). Additionally, Valley2 ranks second on the OpenCompass leaderboard among models with fewer than 10B parameters, with an impressive average score of 67.4. The code and model weights are open-sourced at https://github.com/bytedance/Valley.
中文: Valley2是一种新型多模态大语言模型,在电商基准测试中取得最优性能,并在少于100亿参数的模型中排名第二,其代码和模型权重已开源。
English: Valley2 is a new multimodal large language model that achieves state-of-the-art performance in e-commerce benchmarks and ranks second among models with fewer than 10B parameters, with its code and model weights being open-sourced.

Authors:Zhifan Song, Yuan Zhang, Abd Al Rahman M. Abu Ebayyeh
Title: EDNet: Edge-Optimized Small Target Detection in UAV Imagery -- Faster Context Attention, Better Feature Fusion, and Hardware Acceleration
Abstract:
Detecting small targets in drone imagery is challenging due to low resolution, complex backgrounds, and dynamic scenes. We propose EDNet, a novel edge-target detection framework built on an enhanced YOLOv10 architecture, optimized for real-time applications without post-processing. EDNet incorporates an XSmall detection head and a Cross Concat strategy to improve feature fusion and multi-scale context awareness for detecting tiny targets in diverse environments. Our unique C2f-FCA block employs Faster Context Attention to enhance feature extraction while reducing computational complexity. The WIoU loss function is employed for improved bounding box regression. With seven model sizes ranging from Tiny to XL, EDNet accommodates various deployment environments, enabling local real-time inference and ensuring data privacy. Notably, EDNet achieves up to a 5.6% gain in mAP@50 with significantly fewer parameters. On an iPhone 12, EDNet variants operate at speeds ranging from 16 to 55 FPS, providing a scalable and efficient solution for edge-based object detection in challenging drone imagery. The source code and pre-trained models are available at: https://github.com/zsniko/EDNet.
中文摘要:EDNet是基于YOLOv10优化的实时目标检测框架,通过改进特征融合和多尺度上下文感知技术,有效提升无人机图像中小目标检测性能,在多种模型尺寸下以更少参数实现更高精度。
English Summary: EDNet is an optimized real-time object detection framework based on YOLOv10 that enhances small target detection in drone imagery through improved feature fusion and multi-scale context awareness, achieving higher accuracy with fewer parameters across multiple model sizes.

Authors:Soyeong Jeong, Kangsan Kim, Jinheon Baek, Sung Ju Hwang
Title: VideoRAG: Retrieval-Augmented Generation over Video Corpus
Abstract:
Retrieval-Augmented Generation (RAG) is a powerful strategy for improving the factual accuracy of models by retrieving external knowledge relevant to queries and incorporating it into the generation process. However, existing approaches primarily focus on text, with some recent advancements considering images, and they largely overlook videos, a rich source of multimodal knowledge capable of representing contextual details more effectively than any other modality. While very recent studies explore the use of videos in response generation, they either predefine query-associated videos without retrieval or convert videos into textual descriptions losing multimodal richness. To tackle these, we introduce VideoRAG, a framework that not only dynamically retrieves videos based on their relevance with queries but also utilizes both visual and textual information. The operation of VideoRAG is powered by recent Large Video Language Models (LVLMs), which enable the direct processing of video content to represent it for retrieval and the seamless integration of retrieved videos jointly with queries for response generation. Also, inspired by that the context size of LVLMs may not be sufficient to process all frames in extremely long videos and not all frames are equally important, we introduce a video frame selection mechanism to extract the most informative subset of frames, along with a strategy to extract textual information from videos (as it can aid the understanding of video content) when their subtitles are not available. We experimentally validate the effectiveness of VideoRAG, showcasing that it is superior to relevant baselines. Code is available at https://github.com/starsuzi/VideoRAG.
中文: VideoRAG是一种创新框架,通过动态检索相关视频并利用大型视频语言模型处理多模态内容来增强回答生成,它通过引入帧选择机制和文本提取策略弥补了现有方法的不足,从而提高了事实准确性。
English: VideoRAG is a novel framework that dynamically retrieves relevant videos and leverages their multimodal content through Large Video Language Models to enhance response generation, addressing gaps in existing methods by incorporating frame selection and text extraction for improved accuracy.

Authors:Antonin Poché, Alon Jacovi, Agustin Martin Picard, Victor Boutin, Fanny Jourdan
Title: ConSim: Measuring Concept-Based Explanations' Effectiveness with Automated Simulatability
Abstract:
Concept-based explanations work by mapping complex model computations to human-understandable concepts. Evaluating such explanations is very difficult, as it includes not only the quality of the induced space of possible concepts but also how effectively the chosen concepts are communicated to users. Existing evaluation metrics often focus solely on the former, neglecting the latter. We introduce an evaluation framework for measuring concept explanations via automated simulatability: a simulator's ability to predict the explained model's outputs based on the provided explanations. This approach accounts for both the concept space and its interpretation in an end-to-end evaluation. Human studies for simulatability are notoriously difficult to enact, particularly at the scale of a wide, comprehensive empirical evaluation (which is the subject of this work). We propose using large language models (LLMs) as simulators to approximate the evaluation and report various analyses to make such approximations reliable. Our method allows for scalable and consistent evaluation across various models and datasets. We report a comprehensive empirical evaluation using this framework and show that LLMs provide consistent rankings of explanation methods. Code available at https://github.com/AnonymousConSim/ConSim.
中文摘要:本文提出了一种利用大型语言模型的自动可模拟性评估框架,用于衡量基于概念的解释方法,实现了跨模型和数据集的可扩展评估,同时兼顾概念质量与传达效果。
English Summary: This paper introduces an automated simulatability framework using large language models to evaluate concept-based explanations, enabling scalable assessment of both concept quality and communication effectiveness across diverse models and datasets.

Authors:Xinting Hu, Haoran Wang, Jan Eric Lenssen, Bernt Schiele
Title: PersonaHOI: Effortlessly Improving Personalized Face with Human-Object Interaction Generation
Abstract:
We introduce PersonaHOI, a training- and tuning-free framework that fuses a general StableDiffusion model with a personalized face diffusion (PFD) model to generate identity-consistent human-object interaction (HOI) images. While existing PFD models have advanced significantly, they often overemphasize facial features at the expense of full-body coherence, PersonaHOI introduces an additional StableDiffusion (SD) branch guided by HOI-oriented text inputs. By incorporating cross-attention constraints in the PFD branch and spatial merging at both latent and residual levels, PersonaHOI preserves personalized facial details while ensuring interactive non-facial regions. Experiments, validated by a novel interaction alignment metric, demonstrate the superior realism and scalability of PersonaHOI, establishing a new standard for practical personalized face with HOI generation. Our code will be available at https://github.com/JoyHuYY1412/PersonaHOI
中文: PersonaHOI是一种无需训练的框架,通过将通用StableDiffusion与个性化人脸扩散模型相结合,在保持全身协调性的同时生成身份一致的人物交互图像,其交叉注意力约束和空间融合技术确保了面部细节与交互区域的真实感。
English: PersonaHOI is a training-free framework that integrates StableDiffusion with personalized face diffusion models to generate identity-consistent human-object interaction images while maintaining full-body coherence through cross-attention constraints and spatial merging.

Authors:Sunwoo Kim, Minkyu Kim, Dongmin Park
Title: Test-time Alignment of Diffusion Models without Reward Over-optimization
Abstract:
Diffusion models excel in generative tasks, but aligning them with specific objectives while maintaining their versatility remains challenging. Existing fine-tuning methods often suffer from reward over-optimization, while approximate guidance approaches fail to optimize target rewards effectively. Addressing these limitations, we propose a training-free, test-time method based on Sequential Monte Carlo (SMC) to sample from the reward-aligned target distribution. Our approach, tailored for diffusion sampling and incorporating tempering techniques, achieves comparable or superior target rewards to fine-tuning methods while preserving diversity and cross-reward generalization. We demonstrate its effectiveness in single-reward optimization, multi-objective scenarios, and online black-box optimization. This work offers a robust solution for aligning diffusion models with diverse downstream objectives without compromising their general capabilities. Code is available at https://github.com/krafton-ai/DAS.
Chinese: 本文提出了一种无需训练、基于序贯蒙特卡洛的测试时方法,能在保持扩散模型多样性和泛化能力的同时有效对齐目标奖励,其性能优于现有微调方法。
English: This paper introduces a training-free, test-time Sequential Monte Carlo method that effectively aligns diffusion models with target rewards while preserving their diversity and generalization, outperforming existing fine-tuning approaches.

Authors:Taywon Min, Haeone Lee, Yongchan Kwon, Kimin Lee
Title: Understanding Impact of Human Feedback via Influence Functions
Abstract:
In Reinforcement Learning from Human Feedback (RLHF), it is crucial to learn suitable reward models from human feedback to align large language models (LLMs) with human intentions. However, human feedback can often be noisy, inconsistent, or biased, especially when evaluating complex responses. Such feedback can lead to misaligned reward signals, potentially causing unintended side effects during the RLHF process. To address these challenges, we explore the use of influence functions to measure the impact of human feedback on the performance of reward models. We propose a compute-efficient approximation method that enables the application of influence functions to LLM-based reward models and large-scale preference datasets. Our experiments showcase two key applications of influence functions: (1) detecting common labeler biases in human feedback datasets and (2) guiding labelers in refining their strategies to better align with expert feedback. By quantifying the impact of human feedback, we believe that influence functions can enhance feedback interpretability and contribute to scalable oversight in RLHF, helping labelers provide more accurate and consistent feedback. Source code is available at https://github.com/mintaywon/IF_RLHF
中文: 本研究引入影响函数来评估人类反馈在RLHF中对奖励模型的影响,能够高效检测偏差并指导标注者提升反馈的准确性和一致性。
English: This study introduces influence functions to assess the impact of human feedback on reward models in RLHF, enabling efficient detection of biases and guidance for labelers to improve feedback accuracy and consistency.

Authors:Sungjae Lee, Hyejin Park, Jaechang Kim, Jungseul Ok
Title: Semantic Exploration with Adaptive Gating for Efficient Problem Solving with Language Models
Abstract:
Recent advancements in large language models (LLMs) have shown remarkable potential in various complex tasks requiring multi-step reasoning methods like tree search to explore diverse reasoning paths. However, existing methods often suffer from computational inefficiency and redundancy. First, they overlook the diversity of task difficulties, leading to unnecessarily extensive searches even for easy tasks. Second, they neglect the semantics of reasoning paths, resulting in redundant exploration of semantically identical paths. To address these limitations, we propose Semantic Exploration with Adaptive Gating (SEAG), a computationally efficient method. SEAG employs an adaptive gating mechanism that dynamically decides whether to conduct a tree search, based on the confidence level of answers from a preceding simple reasoning method. Furthermore, its tree-based exploration consolidates semantically identical reasoning steps, reducing redundant explorations while maintaining or even improving accuracy. Our extensive experiments demonstrate that SEAG significantly improves accuracy by 4.3% on average while requiring only 31% of computational costs compared to existing tree search-based methods on complex reasoning benchmarks including GSM8K and ARC with diverse language models such as Llama2, Llama3, and Mistral. Our code is available at https://github.com/ml-postech/SEAG-semantic-exploration-with-adaptive-gating .
Chinese Summary: 提出的SEAG方法通过自适应门控机制根据任务难度动态调整搜索强度,并整合语义相同的推理路径,在仅需31%计算成本的情况下,相比现有树搜索方法平均准确率提升4.3%。
English Summary: The proposed Semantic Exploration with Adaptive Gating (SEAG) method enhances computational efficiency by dynamically adjusting search efforts based on task difficulty and consolidating semantically similar reasoning paths, achieving 4.3% higher accuracy with only 31% of computational costs compared to existing tree search methods.

Authors:Yi Ma, Shuai Wang, Tianchi Liu, Haizhou Li
Title: ExPO: Explainable Phonetic Trait-Oriented Network for Speaker Verification
Abstract:
In speaker verification, we use computational method to verify if an utterance matches the identity of an enrolled speaker. This task is similar to the manual task of forensic voice comparison, where linguistic analysis is combined with auditory measurements to compare and evaluate voice samples. Despite much success, we have yet to develop a speaker verification system that offers explainable results comparable to those from manual forensic voice comparison. A novel approach, Explainable Phonetic Trait-Oriented (ExPO) network, is proposed in this paper to introduce the speaker's phonetic trait which describes the speaker's characteristics at the phonetic level, resembling what forensic comparison does. ExPO not only generates utterance-level speaker embeddings but also allows for fine-grained analysis and visualization of phonetic traits, offering an explainable speaker verification process. Furthermore, we investigate phonetic traits from within-speaker and between-speaker variation perspectives to determine which trait is most effective for speaker verification, marking an important step towards explainable speaker verification. Our code is available at https://github.com/mmmmayi/ExPO.
Chinese: 本文提出ExPO网络,一种可解释的说话人验证系统,通过分析语音特征提供类似法庭语音比对方法的细粒度可解释结果。
English: This paper introduces the ExPO network, an explainable speaker verification system that analyzes phonetic traits to provide fine-grained, interpretable results similar to forensic voice comparison methods.

Authors:Sehyung Kim, Chanhyeong Yang, Jihwan Park, Taehoon Song, Hyunwoo J. Kim
Title: Super-class guided Transformer for Zero-Shot Attribute Classification
Abstract:
Attribute classification is crucial for identifying specific characteristics within image regions. Vision-Language Models (VLMs) have been effective in zero-shot tasks by leveraging their general knowledge from large-scale datasets. Recent studies demonstrate that transformer-based models with class-wise queries can effectively address zero-shot multi-label classification. However, poor utilization of the relationship between seen and unseen attributes makes the model lack generalizability. Additionally, attribute classification generally involves many attributes, making maintaining the model's scalability difficult. To address these issues, we propose Super-class guided transFormer (SugaFormer), a novel framework that leverages super-classes to enhance scalability and generalizability for zero-shot attribute classification. SugaFormer employs Super-class Query Initialization (SQI) to reduce the number of queries, utilizing common semantic information from super-classes, and incorporates Multi-context Decoding (MD) to handle diverse visual cues. To strengthen generalizability, we introduce two knowledge transfer strategies that utilize VLMs. During training, Super-class guided Consistency Regularization (SCR) aligns model's features with VLMs using super-class guided prompts, and during inference, Zero-shot Retrieval-based Score Enhancement (ZRSE) refines predictions for unseen attributes. Extensive experiments demonstrate that SugaFormer achieves state-of-the-art performance across three widely-used attribute classification benchmarks under zero-shot, and cross-dataset transfer settings. Our code is available at https://github.com/mlvlab/SugaFormer.
中文摘要:提出的SugaFormer框架通过利用超类减少查询数量提升可扩展性,并借助视觉语言模型的知识迁移增强泛化能力,从而改进零样本属性分类性能。
English Summary: The proposed SugaFormer framework enhances zero-shot attribute classification by leveraging super-classes to improve scalability through query reduction and generalizability via knowledge transfer from Vision-Language Models.

Authors:Dominick Reilly, Manish Kumar Govind, Le Xue, Srijan Das
Title: From My View to Yours: Ego-Augmented Learning in Large Vision Language Models for Understanding Exocentric Daily Living Activities
Abstract:
Large Vision Language Models (LVLMs) have demonstrated impressive capabilities in video understanding, yet their adoption for Activities of Daily Living (ADL) remains limited by their inability to capture fine-grained interactions and spatial relationships. To address this, we aim to leverage the complementary nature of egocentric views to enhance LVLM's understanding of exocentric ADL videos. Consequently, we propose ego2exo knowledge distillation to learn ego-augmented exp representations. While effective, this approach requires paired ego-exo videos, which are impractical to collect at scale. To address this, we propose Skeleton-guided Synthetic Ego Generation (SK-EGO), which leverages human skeleton motion to generate synthetic ego views from exocentric videos. To enhance the ego representation of LVLMs trained on synthetic data, we develop a domain-agnostic bootstrapped ego2exo strategy that effectively transfers knowledge from real ego-exo pairs to synthetic ego-exo pairs, while mitigating domain misalignment. We find that the exo representations of our ego-augmented LVLMs successfully learn to extract ego-perspective cues, demonstrated through comprehensive evaluation on six ADL benchmarks and our proposed Ego-in-Exo PerceptionMCQ benchmark designed specifically to assess egocentric understanding from exocentric videos. Code, models, and data will be open-sourced at https://github.com/dominickrei/EgoExo4ADL.
中文: 本研究通过引入自我到外部知识蒸馏和基于骨骼的合成自我视角生成方法,增强大型视觉语言模型对日常生活活动的理解能力,有效解决了配对视频数据难以大规模获取的问题,并通过全面评估验证了从外部视角视频中提取自我视角线索的有效性。
English: This study enhances Large Vision Language Models' understanding of Activities of Daily Living by introducing ego2exo knowledge distillation and a skeleton-guided synthetic ego generation method to overcome the limitation of requiring paired ego-exo videos, with comprehensive evaluations demonstrating improved egocentric cue extraction from exocentric videos.

Authors:Shuolong Chen, Xingxing Li, Liu Yuan, Ziao Liu
Title: eKalibr: Dynamic Intrinsic Calibration for Event Cameras From First Principles of Events
Abstract:
The bio-inspired event camera has garnered extensive research attention in recent years, owing to its significant potential derived from its high dynamic range and low latency characteristics. Similar to the standard camera, the event camera requires precise intrinsic calibration to facilitate further high-level visual applications, such as pose estimation and mapping. While several calibration methods for event cameras have been proposed, most of them are either (i) engineering-driven, heavily relying on conventional image-based calibration pipelines, or (ii) inconvenient, requiring complex instrumentation. To this end, we propose an accurate and convenient intrinsic calibration method for event cameras, named eKalibr, which builds upon a carefully designed event-based circle grid pattern recognition algorithm. To extract target patterns from events, we perform event-based normal flow estimation to identify potential events generated by circle edges, and cluster them spatially. Subsequently, event clusters associated with the same grid circles are matched and grouped using normal flows, for subsequent time-varying ellipse estimation. Fitted ellipse centers are time-synchronized, for final grid pattern recognition. We conducted extensive experiments to evaluate the performance of eKalibr in terms of pattern extraction and intrinsic calibration. The implementation of eKalibr is open-sourced at (https://github.com/Unsigned-Long/eKalibr) to benefit the research community.
Chinese: 本文提出了一种名为eKalibr的事件相机内参标定方法,通过精心设计的事件驱动圆形网格识别算法,实现了精确便捷的标定,解决了现有方法依赖传统流程或需要复杂设备的问题。
English: The paper introduces eKalibr, an accurate and convenient intrinsic calibration method for event cameras that uses a novel event-based circle grid pattern recognition algorithm to overcome the limitations of existing engineering-heavy or complex approaches.

Authors:Yinghao Zhu, Xiaochen Zheng, Ahmed Allam, Michael Krauthammer
Title: TAMER: A Test-Time Adaptive MoE-Driven Framework for EHR Representation Learning
Abstract:
We propose TAMER, a Test-time Adaptive MoE-driven framework for Electronic Health Record (EHR) Representation learning. TAMER introduces a framework where a Mixture-of-Experts (MoE) architecture is co-designed with Test-Time Adaptation (TTA) to jointly mitigate the intertwined challenges of patient heterogeneity and distribution shifts in EHR modeling. The MoE focuses on latent patient subgroups through domain-aware expert specialization, while TTA enables real-time adaptation to evolving health status distributions when new patient samples are introduced. Extensive experiments across four real-world EHR datasets demonstrate that TAMER consistently improves predictive performance for both mortality and readmission risk tasks when combined with diverse EHR modeling backbones. TAMER offers a promising approach for dynamic and personalized EHR-based predictions in practical clinical settings.
中文: TAMER提出一种测试时自适应框架,通过专家混合架构解决电子健康记录建模中的患者异质性和分布偏移问题,在多个真实数据集上显著提升了死亡率和再入院风险的预测性能。
English: TAMER is a test-time adaptive framework using a Mixture-of-Experts architecture to address patient heterogeneity and distribution shifts in EHR modeling, demonstrating improved performance for mortality and readmission predictions across multiple datasets.

Authors:Ayush Khot, Xiwei Wang, Avik Roy, Volodymyr Kindratenko, Mark S. Neubauer
Title: Evidential Deep Learning for Uncertainty Quantification and Out-of-Distribution Detection in Jet Identification using Deep Neural Networks
Abstract:
Current methods commonly used for uncertainty quantification (UQ) in deep learning (DL) models utilize Bayesian methods which are computationally expensive and time-consuming. In this paper, we provide a detailed study of UQ based on evidential deep learning (EDL) for deep neural network models designed to identify jets in high energy proton-proton collisions at the Large Hadron Collider and explore its utility in anomaly detection. EDL is a DL approach that treats learning as an evidence acquisition process designed to provide confidence (or epistemic uncertainty) about test data. Using publicly available datasets for jet classification benchmarking, we explore hyperparameter optimizations for EDL applied to the challenge of UQ for jet identification. We also investigate how the uncertainty is distributed for each jet class, how this method can be implemented for the detection of anomalies, how the uncertainty compares with Bayesian ensemble methods, and how the uncertainty maps onto latent spaces for the models. Our studies uncover some pitfalls of EDL applied to anomaly detection and a more effective way to quantify uncertainty from EDL as compared with the foundational EDL setup. These studies illustrate a methodological approach to interpreting EDL in jet classification models, providing new insights on how EDL quantifies uncertainty and detects out-of-distribution data which may lead to improved EDL methods for DL models applied to classification tasks.
中文: 本研究探讨了证据深度学习在喷注分类模型中的不确定性量化应用,揭示了其相对于贝叶斯方法的优势,并提出了改进异常检测和不确定性解释的新途径。
English: This study explores evidential deep learning (EDL) for uncertainty quantification in jet classification models, revealing its advantages over Bayesian methods and identifying improved approaches for anomaly detection and uncertainty interpretation.

Authors:Zhao Yang, Bing Su, Jiahao Chen, Ji-Rong Wen
Title: Interpretable Enzyme Function Prediction via Residue-Level Detection
Abstract:
Predicting multiple functions labeled with Enzyme Commission (EC) numbers from the enzyme sequence is of great significance but remains a challenge due to its sparse multi-label classification nature, i.e., each enzyme is typically associated with only a few labels out of more than 6000 possible EC numbers. However, existing machine learning algorithms generally learn a fixed global representation for each enzyme to classify all functions, thereby they lack interpretability and the fine-grained information of some function-specific local residue fragments may be overwhelmed. Here we present an attention-based framework, namely ProtDETR (Protein Detection Transformer), by casting enzyme function prediction as a detection problem. It uses a set of learnable functional queries to adaptatively extract different local representations from the sequence of residue-level features for predicting different EC numbers. ProtDETR not only significantly outperforms existing deep learning-based enzyme function prediction methods, but also provides a new interpretable perspective on automatically detecting different local regions for identifying different functions through cross-attentions between queries and residue-level features. Code is available at https://github.com/yangzhao1230/ProtDETR.
中文:ProtDETR是一个基于注意力的框架,将酶功能预测视为检测问题,通过功能查询提取局部表征来预测EC编号,在性能和可解释性上均优于现有方法。
English: ProtDETR is an attention-based framework that treats enzyme function prediction as a detection problem, using functional queries to extract local representations for predicting EC numbers, achieving superior performance and interpretability over existing methods.

Authors:Joe Eappen, Zikang Xiong, Dipam Patel, Aniket Bera, Suresh Jagannathan
Title: Scaling Safe Multi-Agent Control for Signal Temporal Logic Specifications
Abstract:
Existing methods for safe multi-agent control using logic specifications like Signal Temporal Logic (STL) often face scalability issues. This is because they rely either on single-agent perspectives or on Mixed Integer Linear Programming (MILP)-based planners, which are complex to optimize. These methods have proven to be computationally expensive and inefficient when dealing with a large number of agents. To address these limitations, we present a new scalable approach to multi-agent control in this setting. Our method treats the relationships between agents using a graph structure rather than in terms of a single-agent perspective. Moreover, it combines a multi-agent collision avoidance controller with a Graph Neural Network (GNN) based planner, models the system in a decentralized fashion, and trains on STL-based objectives to generate safe and efficient plans for multiple agents, thereby optimizing the satisfaction of complex temporal specifications while also facilitating multi-agent collision avoidance. Our experiments show that our approach significantly outperforms existing methods that use a state-of-the-art MILP-based planner in terms of scalability and performance. The project website is https://jeappen.com/mastl-gcbf-website/ and the code is at https://github.com/jeappen/mastl-gcbf .
中文: 本文提出了一种可扩展的多智能体控制方法,采用图结构结合避障控制器与基于图神经网络的规划器,有效处理复杂时序规范,并在可扩展性和性能上显著优于现有基于混合整数线性规划的方法。
English: This paper introduces a scalable multi-agent control method that uses a graph structure and combines a collision avoidance controller with a GNN-based planner to efficiently handle complex specifications and improve performance over existing MILP-based approaches.

Authors:Anant Mehta, Bryant McArthur, Nagarjuna Kolloju, Zhengzhong Tu
Title: HFMF: Hierarchical Fusion Meets Multi-Stream Models for Deepfake Detection
Abstract:
The rapid progress in deep generative models has led to the creation of incredibly realistic synthetic images that are becoming increasingly difficult to distinguish from real-world data. The widespread use of Variational Models, Diffusion Models, and Generative Adversarial Networks has made it easier to generate convincing fake images and videos, which poses significant challenges for detecting and mitigating the spread of misinformation. As a result, developing effective methods for detecting AI-generated fakes has become a pressing concern. In our research, we propose HFMF, a comprehensive two-stage deepfake detection framework that leverages both hierarchical cross-modal feature fusion and multi-stream feature extraction to enhance detection performance against imagery produced by state-of-the-art generative AI models. The first component of our approach integrates vision Transformers and convolutional nets through a hierarchical feature fusion mechanism. The second component of our framework combines object-level information and a fine-tuned convolutional net model. We then fuse the outputs from both components via an ensemble deep neural net, enabling robust classification performances. We demonstrate that our architecture achieves superior performance across diverse dataset benchmarks while maintaining calibration and interoperability.
中文: 深度生成模型已能制作高度逼真的合成图像,给检测带来挑战,而我们提出的HFMF框架通过分层跨模态融合和多流特征提取增强了伪造检测能力,在多种基准测试中实现了卓越性能。
English: Deep generative models have advanced to create highly realistic synthetic images, posing challenges for detection, but our proposed HFMF framework enhances fake detection through hierarchical cross-modal fusion and multi-stream feature extraction, achieving superior performance across benchmarks.

Authors:Yifei Li, Junbo Niu, Ziyang Miao, Chunjiang Ge, Yuanhang Zhou, Qihao He, Xiaoyi Dong, Haodong Duan, Shuangrui Ding, Rui Qian, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang
Title: OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?
Abstract:
Temporal Awareness, the ability to reason dynamically based on the timestamp when a question is raised, is the key distinction between offline and online video LLMs. Unlike offline models, which rely on complete videos for static, post hoc analysis, online models process video streams incrementally and dynamically adapt their responses based on the timestamp at which the question is posed. Despite its significance, temporal awareness has not been adequately evaluated in existing benchmarks. To fill this gap, we present OVO-Bench (Online-VideO-Benchmark), a novel video benchmark that emphasizes the importance of timestamps for advanced online video understanding capability benchmarking. OVO-Bench evaluates the ability of video LLMs to reason and respond to events occurring at specific timestamps under three distinct scenarios: (1) Backward tracing: trace back to past events to answer the question. (2) Real-time understanding: understand and respond to events as they unfold at the current timestamp. (3) Forward active responding: delay the response until sufficient future information becomes available to answer the question accurately. OVO-Bench comprises 12 tasks, featuring 644 unique videos and approximately human-curated 2,800 fine-grained meta-annotations with precise timestamps. We combine automated generation pipelines with human curation. With these high-quality samples, we further developed an evaluation pipeline to systematically query video LLMs along the video timeline. Evaluations of nine Video-LLMs reveal that, despite advancements on traditional benchmarks, current models struggle with online video understanding, showing a significant gap compared to human agents. We hope OVO-Bench will drive progress in video LLMs and inspire future research in online video reasoning. Our benchmark and code can be accessed at https://github.com/JoeLeelyf/OVO-Bench.
中文: 时间感知能力是在线视频大语言模型区别于离线模型的关键特征,新推出的OVO-Bench通过12项任务和三种场景对此进行评估,揭示了现有模型与人类表现之间的显著差距。
English: Temporal awareness, which enables dynamic reasoning based on question timestamps, is a critical feature distinguishing online video LLMs from offline models, and the newly introduced OVO-Bench evaluates this capability through 12 tasks across three scenarios to reveal current models' significant shortcomings compared to humans.

Authors:Mengshi Qi, Zhe Zhao, Huadong Ma
Title: Human Grasp Generation for Rigid and Deformable Objects with Decomposed VQ-VAE
Abstract:
Generating realistic human grasps is crucial yet challenging for object manipulation in computer graphics and robotics. Current methods often struggle to generate detailed and realistic grasps with full finger-object interaction, as they typically rely on encoding the entire hand and estimating both posture and position in a single step. Additionally, simulating object deformation during grasp generation is still difficult, as modeling such deformation requires capturing the comprehensive relationship among points of the object's surface. To address these limitations, we propose a novel improved Decomposed Vector-Quantized Variational Autoencoder (DVQ-VAE-2), which decomposes the hand into distinct parts and encodes them separately. This part-aware architecture allows for more precise management of hand-object interactions. Furthermore, we introduce a dual-stage decoding strategy that first predicts the grasp type under skeletal constraints and then identifies the optimal grasp position, enhancing both the realism and adaptability of the model to unseen interactions. Furthermore, we introduce a new Mesh UFormer as the backbone network to extract the hierarchical structural representations from the mesh and propose a new normal vector-guided position encoding to simulate the hand-object deformation. In experiments, our model achieves a relative improvement of approximately 14.1% in grasp quality compared to state-of-the-art methods across four widely used benchmarks. Our comparisons with other backbone networks show relative improvements of 2.23% in Hand-object Contact Distance and 5.86% in Quality Index on deformable and rigid object based datasets, respectively. Our source code and model are available at https://github.com/florasion/D-VQVAE.
中文: 本文提出一种改进的分解向量量化变分自编码器,通过分别编码手部部件并采用双阶段解码策略及新型Mesh UFormer骨干网络,在多个基准测试中显著提升了抓取质量和真实感。
English: This paper introduces an improved Decomposed Vector-Quantized Variational Autoencoder that separately encodes hand parts and employs a dual-stage decoding strategy with a novel Mesh UFormer backbone, achieving significant improvements in grasp quality and realism across multiple benchmarks.

Authors:Jingyuan Tang, Yuhuan Zhao, Songlin Sun, Yangang Cai
Title: Implicit Guidance and Explicit Representation of Semantic Information in Points Cloud: A Survey
Abstract:
Point clouds, a prominent method of 3D representation, are extensively utilized across industries such as autonomous driving, surveying, electricity, architecture, and gaming, and have been rigorously investigated for their accuracy and resilience. The extraction of semantic information from scenes enhances both human understanding and machine perception. By integrating semantic information from two-dimensional scenes with three-dimensional point clouds, researchers aim to improve the precision and efficiency of various tasks. This paper provides a comprehensive review of the diverse applications and recent advancements in the integration of semantic information within point clouds. We explore the dual roles of semantic information in point clouds, encompassing both implicit guidance and explicit representation, across traditional and emerging tasks. Additionally, we offer a comparative analysis of publicly available datasets tailored to specific tasks and present notable observations. In conclusion, we discuss several challenges and potential issues that may arise in the future when fully utilizing semantic information in point clouds, providing our perspectives on these obstacles. The classified and organized articles related to semantic based point cloud tasks, and continuously followed up on relevant achievements in different fields, which can be accessed through https://github.com/Jasmine-tjy/Semantic-based-Point-Cloud-Tasks.
Chinese: 本文综述了将语义信息与三维点云融合以提高任务精度和效率的研究,探讨了其在传统和新兴应用中的双重作用,同时指出了未来挑战并提供了相关研究的整理资源。
English: This paper reviews the integration of semantic information with 3D point clouds to enhance task precision and efficiency, exploring its dual roles in traditional and emerging applications while addressing future challenges and providing a curated resource of related studies.

Authors:Gursimran Singh, Xinglu Wang, Yifan Hu, Timothy Yu, Linzi Xing, Wei Jiang, Zhefeng Wang, Xiaolong Bai, Yi Li, Ying Xiong, Yong Zhang, Zhenan Fan
Title: Efficiently Serving Large Multimodal Models Using EPD Disaggregation
Abstract:
Large Multimodal Models (LMMs) extend Large Language Models (LLMs) by handling diverse inputs such as images, audio, and video, but at the cost of adding a multimodal encoding stage that increases both computational and memory overhead. This step negatively affects key Service Level Objectives (SLOs), such as time to first token (TTFT) and time per output token (TPOT). We introduce Encode-Prefill-Decode (EPD) Disaggregation, a novel framework that separates the encoding, prefill, and decode stages onto dedicated resources. Unlike current systems, which bundle encoding and prefill together, our approach decouples these steps, unlocking new opportunities and optimizations. These include a mechanism to cache multimedia tokens for efficient transfer, a novel way to parallelize the encoding load within a request, a module for optimal resource allocation for disaggregated serving, and a novel role-switching method to handle changing workload characteristics. Experimental evaluations with popular LMMs show substantial gains in memory efficiency (up to 15x lower peak memory utilization), batch sizes (up to 22x larger), 10x more images per request, and 2.2x larger KV caches. Furthermore, it leads to significant improvements in SLO attainment (up to 90-100% improvement) and TTFT (up to 71% reduction), compared to systems that do not disaggregate. The code is available at https://github.com/vbdi/epdserve.
Chinese: Encode-Prefill-Decode(EPD)解耦框架将多模态模型各处理阶段分配至专用资源,相比集成系统显著提升了内存效率、批处理规模和服务性能。
English: The Encode-Prefill-Decode (EPD) Disaggregation framework separates multimodal model stages onto dedicated resources, significantly improving memory efficiency, batch sizes, and service performance compared to integrated systems.

Authors:Yifan Yu, Shaohui Liu, Rémi Pautrat, Marc Pollefeys, Viktor Larsson
Title: Relative Pose Estimation through Affine Corrections of Monocular Depth Priors
Abstract:
Monocular depth estimation (MDE) models have undergone significant advancements over recent years. Many MDE models aim to predict affine-invariant relative depth from monocular images, while recent developments in large-scale training and vision foundation models enable reasonable estimation of metric (absolute) depth. However, effectively leveraging these predictions for geometric vision tasks, in particular relative pose estimation, remains relatively under explored. While depths provide rich constraints for cross-view image alignment, the intrinsic noise and ambiguity from the monocular depth priors present practical challenges to improving upon classic keypoint-based solutions. In this paper, we develop three solvers for relative pose estimation that explicitly account for independent affine (scale and shift) ambiguities, covering both calibrated and uncalibrated conditions. We further propose a hybrid estimation pipeline that combines our proposed solvers with classic point-based solvers and epipolar constraints. We find that the affine correction modeling is beneficial to not only the relative depth priors but also, surprisingly, the "metric" ones. Results across multiple datasets demonstrate large improvements of our approach over classic keypoint-based baselines and PnP-based solutions, under both calibrated and uncalibrated setups. We also show that our method improves consistently with different feature matchers and MDE models, and can further benefit from very recent advances on both modules. Code is available at https://github.com/MarkYu98/madpose.
Chinese: 本文提出了新的求解器和混合流程,有效处理单目深度预测中的仿射模糊性问题,显著提升了相对姿态估计的精度,在多种数据集和设置下均优于传统的基于关键点的方法。
English: This paper introduces novel solvers and a hybrid pipeline that effectively address affine ambiguities in monocular depth predictions to significantly improve relative pose estimation, outperforming traditional keypoint-based methods across various datasets and setups.

Authors:Yiwen Huang, Aaron Gokaslan, Volodymyr Kuleshov, James Tompkin
Title: The GAN is dead; long live the GAN! A Modern GAN Baseline
Abstract:
There is a widely-spread claim that GANs are difficult to train, and GAN architectures in the literature are littered with empirical tricks. We provide evidence against this claim and build a modern GAN baseline in a more principled manner. First, we derive a well-behaved regularized relativistic GAN loss that addresses issues of mode dropping and non-convergence that were previously tackled via a bag of ad-hoc tricks. We analyze our loss mathematically and prove that it admits local convergence guarantees, unlike most existing relativistic losses. Second, our new loss allows us to discard all ad-hoc tricks and replace outdated backbones used in common GANs with modern architectures. Using StyleGAN2 as an example, we present a roadmap of simplification and modernization that results in a new minimalist baseline -- R3GAN. Despite being simple, our approach surpasses StyleGAN2 on FFHQ, ImageNet, CIFAR, and Stacked MNIST datasets, and compares favorably against state-of-the-art GANs and diffusion models.
中文: 本研究通过引入一种原理性的正则化相对损失,反驳了GAN难以训练的普遍观点,该损失无需经验性技巧并支持现代架构,由此产生的极简基线R3GAN在多个数据集上超越了StyleGAN2,并与顶尖模型相媲美。
English: The study refutes the common belief that GANs are hard to train by introducing a principled, regularized relativistic loss that eliminates the need for empirical tricks and enables the use of modern architectures, resulting in a minimalist baseline, R3GAN, which outperforms StyleGAN2 and competes with top models across multiple datasets.

Authors:Maximilian Dreyer, Jim Berend, Tobias Labarta, Johanna Vielhaben, Thomas Wiegand, Sebastian Lapuschkin, Wojciech Samek
Title: Mechanistic understanding and validation of large AI models with SemanticLens
Abstract:
Unlike human-engineered systems such as aeroplanes, where each component's role and dependencies are well understood, the inner workings of AI models remain largely opaque, hindering verifiability and undermining trust. This paper introduces SemanticLens, a universal explanation method for neural networks that maps hidden knowledge encoded by components (e.g., individual neurons) into the semantically structured, multimodal space of a foundation model such as CLIP. In this space, unique operations become possible, including (i) textual search to identify neurons encoding specific concepts, (ii) systematic analysis and comparison of model representations, (iii) automated labelling of neurons and explanation of their functional roles, and (iv) audits to validate decision-making against requirements. Fully scalable and operating without human input, SemanticLens is shown to be effective for debugging and validation, summarizing model knowledge, aligning reasoning with expectations (e.g., adherence to the ABCDE-rule in melanoma classification), and detecting components tied to spurious correlations and their associated training data. By enabling component-level understanding and validation, the proposed approach helps bridge the "trust gap" between AI models and traditional engineered systems. We provide code for SemanticLens on https://github.com/jim-berend/semanticlens and a demo on https://semanticlens.hhi-research-insights.eu.
中文: 本文提出SemanticLens方法,通过将神经网络组件映射到语义空间,实现可扩展的解释功能,有助于调试、验证并增强对AI系统的信任。
English: This paper introduces SemanticLens, a scalable method that explains neural networks by mapping their components to a semantic space, enabling debugging, validation, and enhanced trust in AI systems.

Authors:Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, Zhicheng Dou
Title: Search-o1: Agentic Search-Enhanced Large Reasoning Models
Abstract:
Large reasoning models (LRMs) like OpenAI-o1 have demonstrated impressive long stepwise reasoning capabilities through large-scale reinforcement learning. However, their extended reasoning processes often suffer from knowledge insufficiency, leading to frequent uncertainties and potential errors. To address this limitation, we introduce \textbf{Search-o1}, a framework that enhances LRMs with an agentic retrieval-augmented generation (RAG) mechanism and a Reason-in-Documents module for refining retrieved documents. Search-o1 integrates an agentic search workflow into the reasoning process, enabling dynamic retrieval of external knowledge when LRMs encounter uncertain knowledge points. Additionally, due to the verbose nature of retrieved documents, we design a separate Reason-in-Documents module to deeply analyze the retrieved information before injecting it into the reasoning chain, minimizing noise and preserving coherent reasoning flow. Extensive experiments on complex reasoning tasks in science, mathematics, and coding, as well as six open-domain QA benchmarks, demonstrate the strong performance of Search-o1. This approach enhances the trustworthiness and applicability of LRMs in complex reasoning tasks, paving the way for more reliable and versatile intelligent systems. The code is available at \url{https://github.com/sunnynexus/Search-o1}.
中文: Search-o1通过引入代理检索增强生成机制和文档内推理模块,动态获取并精炼外部知识,有效提升了大型推理模型在复杂任务中的性能和可信度。
English: Search-o1 enhances large reasoning models by integrating an agentic retrieval-augmented generation mechanism and a Reason-in-Documents module to dynamically retrieve and refine external knowledge, improving performance and reliability in complex reasoning tasks.

Authors:Wolfgang Gritz, Anett Hoppe, Ralph Ewerth
Title: Unraveling the Impact of Visual Complexity on Search as Learning
Abstract:
Information search has become essential for learning and knowledge acquisition, offering broad access to information and learning resources. The visual complexity of web pages is known to influence search behavior, with previous work suggesting that searchers make evaluative judgments within the first second on a page. However, there is a significant gap in our understanding of how visual complexity impacts searches specifically conducted with a learning intent. This gap is particularly relevant for the development of optimized information retrieval (IR) systems that effectively support educational objectives. To address this research need, we model visual complexity and aesthetics via a diverse set of features, investigating their relationship with search behavior during learning-oriented web sessions. Our study utilizes a publicly available dataset from a lab study where participants learned about thunderstorm formation. Our findings reveal that while content relevance is the most significant predictor for knowledge gain, sessions with less visually complex pages are associated with higher learning success. This observation applies to features associated with the layout of web pages rather than to simpler features (e.g., number of images). The reported results shed light on the impact of visual complexity on learning-oriented searches, informing the design of more effective IR systems for educational contexts. To foster reproducibility, we release our source code (https://github.com/TIBHannover/sal_visual_complexity).
中文摘要:网页的视觉复杂度对学习导向的搜索行为有显著影响,布局简洁的页面与更高的学习成效相关,这为优化教育信息检索系统提供了重要依据。
English Summary: Visual complexity of web pages significantly affects learning-oriented search behavior, with less complex layouts correlating with higher learning success, guiding the design of educational information retrieval systems.

Authors:Xinzi Cao, Xiawu Zheng, Guanhong Wang, Weijiang Yu, Yunhang Shen, Ke Li, Yutong Lu, Yonghong Tian
Title: Solving the Catastrophic Forgetting Problem in Generalized Category Discovery
Abstract:
Generalized Category Discovery (GCD) aims to identify a mix of known and novel categories within unlabeled data sets, providing a more realistic setting for image recognition. Essentially, GCD needs to remember existing patterns thoroughly to recognize novel categories. Recent state-of-the-art method SimGCD transfers the knowledge from known-class data to the learning of novel classes through debiased learning. However, some patterns are catastrophically forgot during adaptation and thus lead to poor performance in novel categories classification. To address this issue, we propose a novel learning approach, LegoGCD, which is seamlessly integrated into previous methods to enhance the discrimination of novel classes while maintaining performance on previously encountered known classes. Specifically, we design two types of techniques termed as Local Entropy Regularization (LER) and Dual-views Kullback Leibler divergence constraint (DKL). The LER optimizes the distribution of potential known class samples in unlabeled data, thus ensuring the preservation of knowledge related to known categories while learning novel classes. Meanwhile, DKL introduces Kullback Leibler divergence to encourage the model to produce a similar prediction distribution of two view samples from the same image. In this way, it successfully avoids mismatched prediction and generates more reliable potential known class samples simultaneously. Extensive experiments validate that the proposed LegoGCD effectively addresses the known category forgetting issue across all datasets, eg, delivering a 7.74% and 2.51% accuracy boost on known and novel classes in CUB, respectively. Our code is available at: https://github.com/Cliffia123/LegoGCD.
中文: 提出的LegoGCD方法通过局部熵正则化和双视图KL散度约束,在广义类别发现中增强新类别的区分能力同时防止已知类别被遗忘,在多个基准数据集上实现了显著的准确率提升。
English: The proposed LegoGCD method enhances novel class discrimination in Generalized Category Discovery by integrating Local Entropy Regularization and Dual-views KL divergence to prevent catastrophic forgetting of known categories, achieving significant accuracy improvements on benchmark datasets.

Authors:Fabian Hörst, Moritz Rempe, Helmut Becker, Lukas Heine, Julius Keyl, Jens Kleesiek
Title: CellViT++: Energy-Efficient and Adaptive Cell Segmentation and Classification Using Foundation Models
Abstract:
Digital Pathology is a cornerstone in the diagnosis and treatment of diseases. A key task in this field is the identification and segmentation of cells in hematoxylin and eosin-stained images. Existing methods for cell segmentation often require extensive annotated datasets for training and are limited to a predefined cell classification scheme. To overcome these limitations, we propose $\text{CellViT}^{\scriptscriptstyle ++}$, a framework for generalized cell segmentation in digital pathology. $\text{CellViT}^{\scriptscriptstyle ++}$ utilizes Vision Transformers with foundation models as encoders to compute deep cell features and segmentation masks simultaneously. To adapt to unseen cell types, we rely on a computationally efficient approach. It requires minimal data for training and leads to a drastically reduced carbon footprint. We demonstrate excellent performance on seven different datasets, covering a broad spectrum of cell types, organs, and clinical settings. The framework achieves remarkable zero-shot segmentation and data-efficient cell-type classification. Furthermore, we show that $\text{CellViT}^{\scriptscriptstyle ++}$ can leverage immunofluorescence stainings to generate training datasets without the need for pathologist annotations. The automated dataset generation approach surpasses the performance of networks trained on manually labeled data, demonstrating its effectiveness in creating high-quality training datasets without expert annotations. To advance digital pathology, $\text{CellViT}^{\scriptscriptstyle ++}$ is available as an open-source framework featuring a user-friendly, web-based interface for visualization and annotation. The code is available under https://github.com/TIO-IKIM/CellViT-plus-plus.
中文: CellViT++框架采用基于基础模型的视觉变换器,实现了数字病理学中的通用细胞分割,在多种数据集上表现出色,仅需少量训练数据即可完成零样本分割。
English: The proposed CellViT++ framework utilizes Vision Transformers with foundation models to achieve generalized cell segmentation in digital pathology, demonstrating excellent performance across diverse datasets while requiring minimal training data and enabling zero-shot segmentation.

Authors:Xuyang Liu, Ziming Wang, Junjie Chen, Yuhang Han, Yingyao Wang, Jiale Yuan, Jun Song, Linfeng Zhang, Siteng Huang, Honggang Chen
Title: Global Compression Commander: Plug-and-Play Inference Acceleration for High-Resolution Large Vision-Language Models
Abstract:
Large vision-language models (LVLMs) excel at visual understanding, but face efficiency challenges due to quadratic complexity in processing long multi-modal contexts. While token compression can reduce computational costs, existing approaches are designed for single-view LVLMs and fail to consider the unique multi-view characteristics of high-resolution LVLMs with dynamic cropping. Existing methods treat all tokens uniformly, but our analysis reveals that global thumbnails can naturally guide the compression of local crops by providing holistic context for informativeness evaluation. In this paper, we first analyze dynamic cropping strategy, revealing both the complementary nature between thumbnails and crops, and the distinctive characteristics across different crops. Based on our observations, we propose "Global Compression Commander" (GlobalCom$^2$), a novel plug-and-play token compression framework for HR-LVLMs. GlobalCom$^2$ leverages thumbnail as the "commander" to guide the compression of local crops, adaptively preserving informative details while eliminating redundancy. Extensive experiments show that GlobalCom$^2$ maintains over 90% performance while compressing 90% visual tokens, reducing FLOPs and peak memory to 9.1% and 60%. Our code is available at https://github.com/xuyang-liu16/GlobalCom2.
Chinese Summary: 本文提出GlobalCom²框架,通过全局缩略图指导高分辨率视觉语言模型中局部裁剪区域的动态压缩,在保留90%以上性能的同时将计算成本降低高达90%。
English Summary: The paper introduces GlobalCom², a token compression framework that uses global thumbnails to guide adaptive compression of local crops in high-resolution vision-language models, achieving over 90% performance retention while reducing computational costs by up to 90%.

Authors:Bhaskar Lalwani, Aniruddha Mukherjee
Title: KabaddiPy: A package to enable access to Professional Kabaddi Data
Abstract:
Kabaddi, a contact team sport of Indian origin, has seen a dramatic rise in global popularity, highlighted by the upcoming Kabaddi World Cup in 2025 with over sixteen international teams participating, alongside flourishing national leagues such as the Indian Pro Kabaddi League (230 million viewers) and the British Kabaddi League. We present the first open-source Python module to make Kabaddi statistical data easily accessible from multiple scattered sources across the internet. The module was developed by systematically web-scraping and collecting team-wise, player-wise and match-by-match data. The data has been cleaned, organized, and categorized into team overviews and player metrics, each filterable by season. The players are classified as raiders and defenders, with their best strategies for attacking, counter-attacking, and defending against different teams highlighted. Our module enables continuous monitoring of exponentially growing data streams, aiding researchers to quickly start building upon the data to answer critical questions, such as the impact of player inclusion/exclusion on team performance, scoring patterns against specific teams, and break down opponent gameplay. The data generated from Kabaddi tournaments has been sparsely used, and coaches and players rely heavily on intuition to make decisions and craft strategies. Our module can be utilized to build predictive models, craft uniquely strategic gameplays to target opponents and identify hidden correlations in the data. This open source module has the potential to increase time-efficiency, encourage analytical studies of Kabaddi gameplay and player dynamics and foster reproducible research. The data and code are publicly available: https://github.com/kabaddiPy/kabaddiPy
中文: 卡巴迪这项源自印度的团队运动正随着2025年世界杯等国际赛事和职业联赛的兴起而风靡全球,该开源Python模块首次实现了卡巴迪数据的系统化采集与分析,为战术研究和决策提供数据支持。
English: Kabaddi, a traditional Indian team sport, is gaining global traction with events like the 2025 World Cup and popular leagues, and this open-source Python module provides the first centralized tool for accessing and analyzing its statistical data to enhance strategic decisions and research.

Authors:Daniel Nezamabadi, Magnus Myreen
Title: Baking for Dafny: A CakeML Backend for Dafny
Abstract:
Dafny is a verification-aware programming language that allows developers to formally specify their programs and prove them correct. Currently, a Dafny program is compiled in two steps: First, a backend translates the input program to a high-level target language like C# or Rust. Second, the translated program is compiled using the target language's toolchain. Recently, an intermediate representation (IR) has been added to Dafny that serves as input to new backends. At the time of writing, none of these steps are verified, resulting in both the backend and the target language's toolchain being part of Dafny's trusted computing base (TCB). To reduce Dafny's TCB, we started developing a new backend that translates Dafny to CakeML, a verified, bootstrapped subset of Standard ML, in the interactive theorem prover HOL4. We also started to define functional big-step semantics for the Dafny IR to prove correctness of the backend.
中文: Dafny是一种具备验证意识的编程语言,正在开发新的后端以将其转换为CakeML,旨在通过为其中间表示定义函数式大步语义来证明正确性,从而减少其可信计算基。
English: Dafny is a verification-aware programming language that is developing a new backend to translate to CakeML, aiming to reduce its trusted computing base by proving correctness through functional big-step semantics for its intermediate representation.

Authors:Haoyi Xiu, Xin Liu, Taehoon Kim, Kyoung-Sook Kim
Title: Advancing ALS Applications with Large-Scale Pre-training: Dataset Development and Downstream Assessment
Abstract:
The pre-training and fine-tuning paradigm has revolutionized satellite remote sensing applications. However, this approach remains largely underexplored for airborne laser scanning (ALS), an important technology for applications such as forest management and urban planning. In this study, we address this gap by constructing a large-scale ALS point cloud dataset and evaluating its impact on downstream applications. Our dataset comprises ALS point clouds collected across the contiguous United States, provided by the United States Geological Survey's 3D Elevation Program. To ensure efficient data collection while capturing diverse land cover and terrain types, we introduce a geospatial sampling method that selects point cloud tiles based on land cover maps and digital elevation models. As a baseline self-supervised learning model, we adopt BEV-MAE, a state-of-the-art masked autoencoder for 3D outdoor point clouds, and pre-train it on the constructed dataset. The pre-trained models are subsequently fine-tuned for downstream tasks, including tree species classification, terrain scene recognition, and point cloud semantic segmentation. Our results show that the pre-trained models significantly outperform their scratch counterparts across all downstream tasks, demonstrating the transferability of the representations learned from the proposed dataset. Furthermore, we observe that scaling the dataset using our geospatial sampling method consistently enhances performance, whereas pre-training on datasets constructed with random sampling fails to achieve similar improvements. These findings highlight the utility of the constructed dataset and the effectiveness of our sampling strategy in the pre-training and fine-tuning paradigm. The source code and pre-trained models will be made publicly available at \url{https://github.com/martianxiu/ALS_pretraining}.
中文: 本研究通过创新的地理空间采样方法构建了一个大规模机载激光扫描数据集,并证明在该数据集上预训练的模型在树种分类和语义分割等下游任务中性能显著提升。
English: This study develops a large-scale airborne laser scanning dataset using a novel geospatial sampling method and demonstrates that pre-training models on it significantly enhances performance in downstream tasks like tree species classification and semantic segmentation.

Authors:Hounsu Kim, Taegyun Kwon, Juhan Nam
Title: D3RM: A Discrete Denoising Diffusion Refinement Model for Piano Transcription
Abstract:
Diffusion models have been widely used in the generative domain due to their convincing performance in modeling complex data distributions. Moreover, they have shown competitive results on discriminative tasks, such as image segmentation. While diffusion models have also been explored for automatic music transcription, their performance has yet to reach a competitive level. In this paper, we focus on discrete diffusion model's refinement capabilities and present a novel architecture for piano transcription. Our model utilizes Neighborhood Attention layers as the denoising module, gradually predicting the target high-resolution piano roll, conditioned on the finetuned features of a pretrained acoustic model. To further enhance refinement, we devise a novel strategy which applies distinct transition states during training and inference stage of discrete diffusion models. Experiments on the MAESTRO dataset show that our approach outperforms previous diffusion-based piano transcription models and the baseline model in terms of F1 score. Our code is available in https://github.com/hanshounsu/d3rm.
Chinese: 本文提出了一种新颖的离散扩散模型用于钢琴转录,该模型利用邻域注意力层及独特的训练与推理转换策略,在MAESTRO数据集上取得了最优的F1分数。
English: This paper introduces a novel discrete diffusion model for piano transcription that leverages Neighborhood Attention layers and a unique transition strategy during training and inference, achieving state-of-the-art F1 scores on the MAESTRO dataset.

Authors:Chengxing Xie, Bowen Li, Chang Gao, He Du, Wai Lam, Difan Zou, Kai Chen
Title: SWE-Fixer: Training Open-Source LLMs for Effective and Efficient GitHub Issue Resolution
Abstract:
Large Language Models (LLMs) have demonstrated remarkable proficiency across a variety of complex tasks. One significant application of LLMs is in tackling software engineering challenges, particularly in resolving real-world tasks on GitHub by fixing code based on the issues reported by the users. However, many current approaches rely on proprietary LLMs, which limits reproducibility, accessibility, and transparency. The critical components of LLMs for addressing software engineering issues and how their capabilities can be effectively enhanced remain unclear. To address these challenges, we introduce SWE-Fixer, a novel open-source framework designed to effectively and efficiently resolve GitHub issues. SWE-Fixer comprises two essential modules: a code file retrieval module and a code editing module. The retrieval module employs BM25 along with a lightweight model to achieve coarse-to-fine file retrieval. Subsequently, the code editing module utilizes the other model to generate patches for the identified files. To mitigate the lack of publicly available datasets, we compile an extensive dataset that includes 110K GitHub issues along with their corresponding patches and train the two models of SWE-Fixer separately. We assess our approach on the SWE-Bench Lite and Verified benchmarks, achieving competitive performance among open-source models with scores of 22.0% and 30.2%. Furthermore, SWE-Fixer reaches state-of-the-art performance (24.7% on Lite and 32.8% on Verified) with PASS_TO_PASS (P2P) filtering. Additionally, our approach requires only two model calls per instance, making it significantly more efficient than existing methods. These results highlight the effectiveness of SWE-Fixer in real-world code-fixing scenarios. We will make our model, dataset, and code publicly available at https://github.com/InternLM/SWE-Fixer.
中文: SWE-Fixer是一个开源框架,通过文件检索和代码编辑双模块系统高效解决GitHub问题,在基准测试中取得优异性能且仅需少量模型调用。
English: SWE-Fixer is an open-source framework that efficiently resolves GitHub issues through a two-module system for file retrieval and code editing, achieving competitive performance on benchmarks while requiring minimal model calls.

Authors:Ronghao Dang, Yuqian Yuan, Wenqi Zhang, Yifei Xin, Boqiang Zhang, Long Li, Liuyi Wang, Qinyang Zeng, Xin Li, Lidong Bing
Title: ECBench: Can Multi-modal Foundation Models Understand the Egocentric World? A Holistic Embodied Cognition Benchmark
Abstract:
The enhancement of generalization in robots by large vision-language models (LVLMs) is increasingly evident. Therefore, the embodied cognitive abilities of LVLMs based on egocentric videos are of great interest. However, current datasets for embodied video question answering lack comprehensive and systematic evaluation frameworks. Critical embodied cognitive issues, such as robotic self-cognition, dynamic scene perception, and hallucination, are rarely addressed. To tackle these challenges, we propose ECBench, a high-quality benchmark designed to systematically evaluate the embodied cognitive abilities of LVLMs. ECBench features a diverse range of scene video sources, open and varied question formats, and 30 dimensions of embodied cognition. To ensure quality, balance, and high visual dependence, ECBench uses class-independent meticulous human annotation and multi-round question screening strategies. Additionally, we introduce ECEval, a comprehensive evaluation system that ensures the fairness and rationality of the indicators. Utilizing ECBench, we conduct extensive evaluations of proprietary, open-source, and task-specific LVLMs. ECBench is pivotal in advancing the embodied cognitive capabilities of LVLMs, laying a solid foundation for developing reliable core models for embodied agents. All data and code are available at https://github.com/Rh-Dang/ECBench.
中文: ECBench是一个新颖的基准,旨在通过多样化的第一人称视角视频和30个认知维度,系统评估大型视觉语言模型的具身认知能力,弥补现有数据集的不足,并通过精细人工标注和综合评估体系确保公平性。
English: ECBench is a novel benchmark designed to systematically evaluate the embodied cognitive abilities of large vision-language models using diverse egocentric videos and 30 cognitive dimensions, addressing gaps in current datasets while ensuring fairness through meticulous human annotation and a comprehensive evaluation system.

Authors:Xiaojie Li, Jianlong Wu, Yue Yu, Liqiang Nie, Min Zhang
Title: Continuous Knowledge-Preserving Decomposition with Adaptive Layer Selection for Few-Shot Class-Incremental Learning
Abstract:
Few-Shot Class-Incremental Learning (FSCIL) faces a critical challenge: balancing the retention of prior knowledge with the acquisition of new classes. Existing methods either freeze the backbone to prevent catastrophic forgetting, sacrificing plasticity, or add new modules, incurring high costs. These approaches treat pretrained models as black boxes, overlooking two key opportunities to exploit their internal capacity: reusing redundant representational space within layers and selectively adapting layers based on their sensitivity to forgetting. We propose CKPD-FSCIL, a unified framework that unlocks the underutilized capacity of pretrained weights, achieving a superior stability-plasticity balance with zero inference overhead. Our design integrates two continuously adapting mechanisms: At the weight level, a Continuous Knowledge-Preserving Decomposition mechanism uses feature covariance to split each weight matrix into a frozen subspace that safeguards prior knowledge and a learnable, redundant subspace for new tasks. At the layer level, a Continuous Adaptive Layer Selection mechanism leverages an Adapter Sensitivity Ratio to automatically select layers with the highest redundant capacity and lowest forgetting risk for adaptation. By targeting only safe, high-potential subspaces and layers, CKPD-FSCIL enables efficient adaptation. After each session, the learned adapters are merged back into the original weights, ensuring zero additional parameters or FLOPs during inference. Extensive experiments on multiple FSCIL benchmarks demonstrate that our method consistently outperforms state-of-the-art approaches in both adaptability and knowledge retention. The code is available at https://github.com/xiaojieli0903/CKPD-FSCIL.
中文: 本文提出的CKPD-FSCIL框架通过自适应复用预训练权重中的冗余子空间和选择性更新网络层,解决了小样本类增量学习中的稳定性与可塑性平衡难题,在零推理开销下实现了更优的性能。
English: The proposed CKPD-FSCIL framework addresses the stability-plasticity dilemma in Few-Shot Class-Incremental Learning by adaptively reusing redundant subspaces in pretrained weights and selectively updating layers, achieving superior performance without inference overhead.

Authors:Benjamin Reichman, Xiaofan Yu, Lanxiang Hu, Jack Truxal, Atishay Jain, Rushil Chandrupatla, Tajana Šimunić Rosing, Larry Heck
Title: SensorQA: A Question Answering Benchmark for Daily-Life Monitoring
Abstract:
With the rapid growth in sensor data, effectively interpreting and interfacing with these data in a human-understandable way has become crucial. While existing research primarily focuses on learning classification models, fewer studies have explored how end users can actively extract useful insights from sensor data, often hindered by the lack of a proper dataset. To address this gap, we introduce SensorQA, the first human-created question-answering (QA) dataset for long-term time-series sensor data for daily life monitoring. SensorQA is created by human workers and includes 5.6K diverse and practical queries that reflect genuine human interests, paired with accurate answers derived from sensor data. We further establish benchmarks for state-of-the-art AI models on this dataset and evaluate their performance on typical edge devices. Our results reveal a gap between current models and optimal QA performance and efficiency, highlighting the need for new contributions. The dataset and code are available at: https://github.com/benjamin-reichman/SensorQA.
中文: 本文提出首个针对长期传感器数据的人工构建问答数据集SensorQA,包含5.6千条查询以解决人类洞察提取的空白,并揭示了当前AI模型在性能与效率方面的不足。
English: This paper introduces SensorQA, the first human-curated question-answering dataset for long-term sensor data, featuring 5.6K queries to bridge the gap in extracting human-centric insights and revealing current AI models' limitations in performance and efficiency.

Authors:HyunGi Kim, Siwon Kim, Jisoo Mok, Sungroh Yoon
Title: Battling the Non-stationarity in Time Series Forecasting via Test-time Adaptation
Abstract:
Deep Neural Networks have spearheaded remarkable advancements in time series forecasting (TSF), one of the major tasks in time series modeling. Nonetheless, the non-stationarity of time series undermines the reliability of pre-trained source time series forecasters in mission-critical deployment settings. In this study, we introduce a pioneering test-time adaptation framework tailored for TSF (TSF-TTA). TAFAS, the proposed approach to TSF-TTA, flexibly adapts source forecasters to continuously shifting test distributions while preserving the core semantic information learned during pre-training. The novel utilization of partially-observed ground truth and gated calibration module enables proactive, robust, and model-agnostic adaptation of source forecasters. Experiments on diverse benchmark datasets and cutting-edge architectures demonstrate the efficacy and generality of TAFAS, especially in long-term forecasting scenarios that suffer from significant distribution shifts. The code is available at https://github.com/kimanki/TAFAS.
中文: 深度神经网络推动了时间序列预测的显著进展,但非平稳性影响了其可靠性;为此提出的TAFAS测试时适应框架,能在保持预训练核心语义的同时,灵活调整预测器以适应持续变化的测试分布。
English: Deep Neural Networks have advanced time series forecasting, but their reliability is challenged by non-stationarity, leading to the development of TAFAS, a test-time adaptation framework that robustly adjusts forecasters to shifting distributions while preserving pre-trained knowledge.

Authors:Haoran Zhu, Zhenyuan Dong, Kristi Topollai, Anna Choromanska
Title: AD-L-JEPA: Self-Supervised Spatial World Models with Joint Embedding Predictive Architecture for Autonomous Driving with LiDAR Data
Abstract:
As opposed to human drivers, current autonomous driving systems still require vast amounts of labeled data to train. Recently, world models have been proposed to simultaneously enhance autonomous driving capabilities by improving the way these systems understand complex real-world environments and reduce their data demands via self-supervised pre-training. In this paper, we present AD-L-JEPA (aka Autonomous Driving with LiDAR data via a Joint Embedding Predictive Architecture), a novel self-supervised pre-training framework for autonomous driving with LiDAR data that, as opposed to existing methods, is neither generative nor contrastive. Our method learns spatial world models with a joint embedding predictive architecture. Instead of explicitly generating masked unknown regions, our self-supervised world models predict Bird's Eye View (BEV) embeddings to represent the diverse nature of autonomous driving scenes. Our approach furthermore eliminates the need to manually create positive and negative pairs, as is the case in contrastive learning. AD-L-JEPA leads to simpler implementation and enhanced learned representations. We qualitatively and quantitatively demonstrate high-quality of embeddings learned with AD-L-JEPA. We furthermore evaluate the accuracy and label efficiency of AD-L-JEPA on popular downstream tasks such as LiDAR 3D object detection and associated transfer learning. Our experimental evaluation demonstrates that AD-L-JEPA is a plausible approach for self-supervised pre-training in autonomous driving applications and is the best available approach outperforming SOTA, including most recently proposed Occupancy-MAE [1] and ALSO [2]. The source code of AD-L-JEPA is available at https://github.com/HaoranZhuExplorer/AD-L-JEPA-Release.
中文: 本文提出AD-L-JEPA这一新型自监督预训练框架,通过联合嵌入预测架构从激光雷达数据中学习空间世界模型,无需生成或对比方法,在精度和标签效率上超越了现有最优方法。
English: This paper introduces AD-L-JEPA, a novel self-supervised pre-training framework for autonomous driving that uses a joint embedding predictive architecture to learn spatial world models from LiDAR data, eliminating the need for generative or contrastive methods and outperforming state-of-the-art approaches in accuracy and label efficiency.

Authors:Wenqian Cui, Xiaoqi Jiao, Ziqiao Meng, Irwin King
Title: VoxEval: Benchmarking the Knowledge Understanding Capabilities of End-to-End Spoken Language Models
Abstract:
With the rising need for speech-based interaction models, end-to-end Spoken Language Models (SLMs) have emerged as a promising solution. While these models require comprehensive world knowledge for meaningful and reliable human interactions, existing question-answering (QA) benchmarks fall short in evaluating SLMs' knowledge understanding due to their inability to support end-to-end speech evaluation and account for varied input audio conditions. To address these limitations, we present VoxEval, a novel SpeechQA benchmark that assesses SLMs' knowledge understanding through pure speech interactions. Our benchmark 1) uniquely maintains speech format for both inputs and outputs, 2) evaluates model robustness across diverse input audio conditions, and 3) pioneers the assessment of complex tasks like mathematical reasoning in spoken format. Systematic evaluation demonstrates that VoxEval presents significant challenges to current SLMs, revealing their sensitivity to varying audio conditions and highlighting the need to enhance reasoning capabilities in future development. We hope this benchmark could guide the advancement of more sophisticated and reliable SLMs. VoxEval dataset is available at: https://github.com/dreamtheater123/VoxEval
中文:VoxEval是一种创新的语音问答基准,通过纯语音交互评估口语模型的知识理解能力,测试其在多样化音频条件和复杂推理任务中的鲁棒性。
English: VoxEval is a novel SpeechQA benchmark designed to evaluate spoken language models' knowledge understanding through pure speech interactions, testing their robustness across diverse audio conditions and complex reasoning tasks.

Authors:Lei Li, Xinglin Zhang, Jun Liang, Tao Chen
Title: Addressing Domain Shift via Imbalance-Aware Domain Adaptation in Embryo Development Assessment
Abstract:
Deep learning models in medical imaging face dual challenges: domain shift, where models perform poorly when deployed in settings different from their training environment, and class imbalance, where certain disease conditions are naturally underrepresented. We present Imbalance-Aware Domain Adaptation (IADA), a novel framework that simultaneously tackles both challenges through three key components: (1) adaptive feature learning with class-specific attention mechanisms, (2) balanced domain alignment with dynamic weighting, and (3) adaptive threshold optimization. Our theoretical analysis establishes convergence guarantees and complexity bounds. Through extensive experiments on embryo development assessment across four imaging modalities, IADA demonstrates significant improvements over existing methods, achieving up to 25.19\% higher accuracy while maintaining balanced performance across classes. In challenging scenarios with low-quality imaging systems, IADA shows robust generalization with AUC improvements of up to 12.56\%. These results demonstrate IADA's potential for developing reliable and equitable medical imaging systems for diverse clinical settings. The code is made public available at \url{https://github.com/yinghemedical/imbalance-aware_domain_adaptation}
中文:IADA框架通过自适应特征学习、平衡域对齐和阈值优化,有效解决了医学影像中的域偏移和类别不平衡问题,在多种临床场景中实现了高达25.19%的准确率提升和稳健的泛化性能。
English: The IADA framework effectively addresses domain shift and class imbalance in medical imaging through adaptive feature learning, balanced domain alignment, and threshold optimization, achieving up to 25.19% higher accuracy and robust generalization across diverse clinical settings.

Authors:Qingyu Ren, Jie Zeng, Qianyu He, Jiaqing Liang, Yanghua Xiao, Weikang Zhou, Zeye Sun, Fei Yu
Title: Step-by-Step Mastery: Enhancing Soft Constraint Following Ability of Large Language Models
Abstract:
It is crucial for large language models (LLMs) to follow instructions that involve multiple constraints. However, it is an unexplored area to enhance LLMs' ability to follow soft constraints. To bridge the gap, we initially design a pipeline to construct datasets with high-quality outputs automatically. Additionally, to fully utilize the positive and negative samples generated during the data construction process, we choose Direct Preference Optimization (DPO) as the training method. Furthermore, taking into account the difficulty of soft constraints indicated by the number of constraints, we design a curriculum learning training paradigm based on the constraint quantity. We experimentally evaluate the effectiveness of our methods in improving LLMs' soft constraint following ability and analyze the factors driving the improvements.The datasets and code are publicly available at https://github.com/Rainier-rq/FollowSoftConstraint.
中文: 本研究提出了一种自动构建高质量数据集的新流程,并采用直接偏好优化与课程学习相结合的方法,有效提升了大语言模型遵循软约束的能力,实验验证了其显著改进效果。
English: This study introduces a novel pipeline for automatically generating datasets and employs Direct Preference Optimization with curriculum learning to enhance large language models' ability to follow soft constraints, demonstrating significant improvements through experimental validation.

Authors:Yapeng Li, Yong Luo, Lefei Zhang, Zengmao Wang, Bo Du
Title: MambaHSI: Spatial-Spectral Mamba for Hyperspectral Image Classification
Abstract:
Transformer has been extensively explored for hyperspectral image (HSI) classification. However, transformer poses challenges in terms of speed and memory usage because of its quadratic computational complexity. Recently, the Mamba model has emerged as a promising approach, which has strong long-distance modeling capabilities while maintaining a linear computational complexity. However, representing the HSI is challenging for the Mamba due to the requirement for an integrated spatial and spectral understanding. To remedy these drawbacks, we propose a novel HSI classification model based on a Mamba model, named MambaHSI, which can simultaneously model long-range interaction of the whole image and integrate spatial and spectral information in an adaptive manner. Specifically, we design a spatial Mamba block (SpaMB) to model the long-range interaction of the whole image at the pixel-level. Then, we propose a spectral Mamba block (SpeMB) to split the spectral vector into multiple groups, mine the relations across different spectral groups, and extract spectral features. Finally, we propose a spatial-spectral fusion module (SSFM) to adaptively integrate spatial and spectral features of a HSI. To our best knowledge, this is the first image-level HSI classification model based on the Mamba. We conduct extensive experiments on four diverse HSI datasets. The results demonstrate the effectiveness and superiority of the proposed model for HSI classification. This reveals the great potential of Mamba to be the next-generation backbone for HSI models. Codes are available at https://github.com/li-yapeng/MambaHSI .
Chinese: 提出的MambaHSI模型通过空间和光谱模块及融合机制,克服了Transformer和Mamba在高光谱图像分类中的不足,实现了卓越性能,展现出作为下一代骨干网络的巨大潜力。
English: The proposed MambaHSI model addresses the limitations of Transformers and Mamba in hyperspectral image classification by integrating spatial and spectral features through specialized blocks and a fusion module, demonstrating superior performance and potential as a next-generation backbone.

Authors:Guannan Lai, Yihui Feng, Xin Yang, Xiaoyu Deng, Hao Yu, Shuyin Xia, Guoyin Wang, Tianrui Li
Title: A New Perspective on Privacy Protection in Federated Learning with Granular-Ball Computing
Abstract:
Federated Learning (FL) facilitates collaborative model training while prioritizing privacy by avoiding direct data sharing. However, most existing articles attempt to address challenges within the model's internal parameters and corresponding outputs, while neglecting to solve them at the input level. To address this gap, we propose a novel framework called Granular-Ball Federated Learning (GrBFL) for image classification. GrBFL diverges from traditional methods that rely on the finest-grained input data. Instead, it segments images into multiple regions with optimal coarse granularity, which are then reconstructed into a graph structure. We designed a two-dimensional binary search segmentation algorithm based on variance constraints for GrBFL, which effectively removes redundant information while preserving key representative features. Extensive theoretical analysis and experiments demonstrate that GrBFL not only safeguards privacy and enhances efficiency but also maintains robust utility, consistently outperforming other state-of-the-art FL methods. The code is available at https://github.com/AIGNLAI/GrBFL.
中文: 本文提出了一种名为粒度球联邦学习(GrBFL)的新框架,通过粗粒度分割和图像重构为图结构,在保护隐私的同时提高了联邦学习的效率和性能,优于现有先进方法。
English: This paper introduces Granular-Ball Federated Learning (GrBFL), a novel framework that processes images through coarse-grained segmentation and graph reconstruction to enhance privacy, efficiency, and utility in federated learning, outperforming existing methods.

Authors:Sun-Hyuk Choi, Hayoung Jo, Seong-Whan Lee
Title: Multi-Context Temporal Consistent Modeling for Referring Video Object Segmentation
Abstract:
Referring video object segmentation aims to segment objects within a video corresponding to a given text description. Existing transformer-based temporal modeling approaches face challenges related to query inconsistency and the limited consideration of context. Query inconsistency produces unstable masks of different objects in the middle of the video. The limited consideration of context leads to the segmentation of incorrect objects by failing to adequately account for the relationship between the given text and instances. To address these issues, we propose the Multi-context Temporal Consistency Module (MTCM), which consists of an Aligner and a Multi-Context Enhancer (MCE). The Aligner removes noise from queries and aligns them to achieve query consistency. The MCE predicts text-relevant queries by considering multi-context. We applied MTCM to four different models, increasing performance across all of them, particularly achieving 47.6 J&F on the MeViS. Code is available at https://github.com/Choi58/MTCM.
Chinese: 提出的多上下文时序一致性模块(MTCM)通过对齐查询和增强多上下文理解,解决了指代视频对象分割中的查询不一致性和上下文考虑不足问题,在多个模型中提升了性能。
English: The proposed Multi-context Temporal Consistency Module (MTCM) addresses query inconsistency and limited contextual consideration in referring video object segmentation by aligning queries and enhancing multi-context understanding, improving performance across multiple models.

Authors:Zhenghui Zhao, Chen Wu, Lixiang Ru, Di Wang, Hongruixuan Chen, Cuiqun Chen
Title: Plug-and-Play DISep: Separating Dense Instances for Scene-to-Pixel Weakly-Supervised Change Detection in High-Resolution Remote Sensing Images
Abstract:
Existing Weakly-Supervised Change Detection (WSCD) methods often encounter the problem of "instance lumping" under scene-level supervision, particularly in scenarios with a dense distribution of changed instances (i.e., changed objects). In these scenarios, unchanged pixels between changed instances are also mistakenly identified as changed, causing multiple changes to be mistakenly viewed as one. In practical applications, this issue prevents the accurate quantification of the number of changes. To address this issue, we propose a Dense Instance Separation (DISep) method as a plug-and-play solution, refining pixel features from a unified instance perspective under scene-level supervision. Specifically, our DISep comprises a three-step iterative training process: 1) Instance Localization: We locate instance candidate regions for changed pixels using high-pass class activation maps. 2) Instance Retrieval: We identify and group these changed pixels into different instance IDs through connectivity searching. Then, based on the assigned instance IDs, we extract corresponding pixel-level features on a per-instance basis. 3) Instance Separation: We introduce a separation loss to enforce intra-instance pixel consistency in the embedding space, thereby ensuring separable instance feature representations. The proposed DISep adds only minimal training cost and no inference cost. It can be seamlessly integrated to enhance existing WSCD methods. We achieve state-of-the-art performance by enhancing {three Transformer-based and four ConvNet-based methods} on the LEVIR-CD, WHU-CD, DSIFN-CD, SYSU-CD, and CDD datasets. Additionally, our DISep can be used to improve fully-supervised change detection methods. Code is available at https://github.com/zhenghuizhao/Plug-and-Play-DISep-for-Change-Detection.
Chinese: 提出的密集实例分离(DISep)方法通过三步迭代训练过程从统一实例角度优化像素特征,有效解决了弱监督变化检测中的“实例堆积”问题,在多个数据集上以最小额外成本实现了最先进的性能。
English: The proposed Dense Instance Separation (DISep) method effectively addresses the "instance lumping" issue in Weakly-Supervised Change Detection by refining pixel features through a three-step iterative process, achieving state-of-the-art performance across multiple datasets with minimal additional cost.

Authors:Jake H. Lee, Michael Kiper, David R. Thompson, Philip G. Brodrick
Title: SpecTf: Transformers Enable Data-Driven Imaging Spectroscopy Cloud Detection
Abstract:
Current and upcoming generations of visible-shortwave infrared (VSWIR) imaging spectrometers promise unprecedented capacity to quantify Earth System processes across the globe. However, reliable cloud screening remains a fundamental challenge for these instruments, where traditional spatial and temporal approaches are limited by cloud variability and limited temporal coverage. The Spectroscopic Transformer (SpecTf) addresses these challenges with a spectroscopy-specific deep learning architecture that performs cloud detection using only spectral information (no spatial or temporal data are required). By treating spectral measurements as sequences rather than image channels, SpecTf learns fundamental physical relationships without relying on spatial context. Our experiments demonstrate that SpecTf significantly outperforms the current baseline approach implemented for the EMIT instrument, and performs comparably with other machine learning methods with orders of magnitude fewer learned parameters. Critically, we demonstrate SpecTf's inherent interpretability through its attention mechanism, revealing physically meaningful spectral features the model has learned. Finally, we present SpecTf's potential for cross-instrument generalization by applying it to a different instrument on a different platform without modifications, opening the door to instrument agnostic data driven algorithms for future imaging spectroscopy tasks.
中文: 光谱变换器(SpecTf)提出了一种专用于光谱学的深度学习架构,仅利用光谱信息进行云检测,不仅显著优于现有方法,还具备可解释性和跨仪器通用性,无需依赖空间或时间数据。
English: The Spectroscopic Transformer (SpecTf) introduces a spectroscopy-specific deep learning model that uses only spectral data for cloud detection, outperforming existing methods while offering interpretability and cross-instrument generalization without requiring spatial or temporal information.

Authors:Golriz Hosseinimanesh, Farnoosh Ghadiri, Francois Guibault, Farida Cheriet, Julia Keren
Title: From Mesh Completion to AI Designed Crown
Abstract:
Designing a dental crown is a time-consuming and labor intensive process. Our goal is to simplify crown design and minimize the tediousness of making manual adjustments while still ensuring the highest level of accuracy and consistency. To this end, we present a new end- to-end deep learning approach, coined Dental Mesh Completion (DMC), to generate a crown mesh conditioned on a point cloud context. The dental context includes the tooth prepared to receive a crown and its surroundings, namely the two adjacent teeth and the three closest teeth in the opposing jaw. We formulate crown generation in terms of completing this point cloud context. A feature extractor first converts the input point cloud into a set of feature vectors that represent local regions in the point cloud. The set of feature vectors is then fed into a transformer to predict a new set of feature vectors for the missing region (crown). Subsequently, a point reconstruction head, followed by a multi-layer perceptron, is used to predict a dense set of points with normals. Finally, a differentiable point-to-mesh layer serves to reconstruct the crown surface mesh. We compare our DMC method to a graph-based convolutional neural network which learns to deform a crown mesh from a generic crown shape to the target geometry. Extensive experiments on our dataset demonstrate the effectiveness of our method, which attains an average of 0.062 Chamfer Distance.The code is available at:https://github.com/Golriz-code/DMC.gi
中文: 本文提出DMC深度学习模型,通过点云数据自动生成牙冠网格,在保证精度的同时显著简化了传统牙冠设计的繁琐流程。
English: This paper introduces Dental Mesh Completion (DMC), an end-to-end deep learning method that automatically generates dental crown meshes from point cloud data to streamline design while maintaining high accuracy.

Authors:Yiyao Yang, Fu Teng, Pengju Liu, Mengnan Qi, Chenyang Lv, Ji Li, Xuhong Zhang, Zhezhi He
Title: HaVen: Hallucination-Mitigated LLM for Verilog Code Generation Aligned with HDL Engineers
Abstract:
Recently, the use of large language models (LLMs) for Verilog code generation has attracted great research interest to enable hardware design automation. However, previous works have shown a gap between the ability of LLMs and the practical demands of hardware description language (HDL) engineering. This gap includes differences in how engineers phrase questions and hallucinations in the code generated. To address these challenges, we introduce HaVen, a novel LLM framework designed to mitigate hallucinations and align Verilog code generation with the practices of HDL engineers. HaVen tackles hallucination issues by proposing a comprehensive taxonomy and employing a chain-of-thought (CoT) mechanism to translate symbolic modalities (e.g. truth tables, state diagrams, etc.) into accurate natural language descriptions. Furthermore, HaVen bridges this gap by using a data augmentation strategy. It synthesizes high-quality instruction-code pairs that match real HDL engineering practices. Our experiments demonstrate that HaVen significantly improves the correctness of Verilog code generation, outperforming state-of-the-art LLM-based Verilog generation methods on VerilogEval and RTLLM benchmark. HaVen is publicly available at https://github.com/Intelligent-Computing-Research-Group/HaVen.
中文: HaVen框架通过提出分类法和思维链机制来减少幻觉,并采用数据增强策略生成符合实际工程实践的指令-代码对,从而显著缩小大语言模型与硬件描述语言工程需求之间的差距。
English: The HaVen framework addresses the gap between large language models and practical hardware design needs by reducing hallucinations through a novel taxonomy and chain-of-thought mechanism, while enhancing code accuracy with data augmentation that reflects real engineering practices.

Authors:Seyed Amir Bidaki, Amir Mohammadkhah, Kiyan Rezaee, Faeze Hassani, Sadegh Eskandari, Maziar Salahi, Mohammad M. Ghassemi
Title: Online Continual Learning: A Systematic Literature Review of Approaches, Challenges, and Benchmarks
Abstract:
Online Continual Learning (OCL) is a critical area in machine learning, focusing on enabling models to adapt to evolving data streams in real-time while addressing challenges such as catastrophic forgetting and the stability-plasticity trade-off. This study conducts the first comprehensive Systematic Literature Review (SLR) on OCL, analyzing 81 approaches, extracting over 1,000 features (specific tasks addressed by these approaches), and identifying more than 500 components (sub-models within approaches, including algorithms and tools). We also review 83 datasets spanning applications like image classification, object detection, and multimodal vision-language tasks. Our findings highlight key challenges, including reducing computational overhead, developing domain-agnostic solutions, and improving scalability in resource-constrained environments. Furthermore, we identify promising directions for future research, such as leveraging self-supervised learning for multimodal and sequential data, designing adaptive memory mechanisms that integrate sparse retrieval and generative replay, and creating efficient frameworks for real-world applications with noisy or evolving task boundaries. By providing a rigorous and structured synthesis of the current state of OCL, this review offers a valuable resource for advancing this field and addressing its critical challenges and opportunities. The complete SLR methodology steps and extracted data are publicly available through the provided link: https://github.com/kiyan-rezaee/ Systematic-Literature-Review-on-Online-Continual-Learning
本研究首次对在线持续学习领域开展系统性文献综述,分析了81种方法并识别出计算效率等关键挑战,同时提出了自监督学习和自适应记忆机制等未来研究方向。
This study presents the first comprehensive systematic literature review on Online Continual Learning, analyzing 81 approaches and identifying key challenges like computational efficiency and future directions including self-supervised learning and adaptive memory mechanisms.

Authors:Long Mai, Julie Carson-Berndsen
Title: Real-Time Textless Dialogue Generation
Abstract:
Recent advancements in large language models (LLMs) have led to significant progress in text-based dialogue systems. These systems can now generate high-quality responses that are accurate and coherent across a wide range of topics and tasks. However, spoken dialogue systems still lag behind in terms of naturalness. They tend to produce robotic interactions, with issues such as slow response times, overly generic or cautious replies, and a lack of natural rhythm and fluid turn-taking. This shortcoming is largely due to the over-reliance on the traditional cascaded design, which involve separate, sequential components, as well as the use of text as an intermediate representation. This paper propose a real-time, textless spoken dialogue generation model (RTTL-DG) that aims to overcome these challenges. Our system enables fluid turn-taking and generates responses with minimal delay by processing streaming spoken conversation directly. Additionally, our model incorporates backchannels, filters, laughter, and other paralinguistic signals, which are often absent in cascaded dialogue systems, to create more natural and human-like interactions. The implementations and generated samples are available in our repository: https://github.com/mailong25/rts2s-dg
中文: 尽管大语言模型的进步提升了文本对话系统的表现,但语音对话系统因依赖级联设计和文本中间表示而缺乏自然性,为此本文提出一种实时无文本生成模型,通过直接处理语音流并融入副语言特征来实现更流畅自然的交互。
English: Recent advancements in large language models have improved text-based dialogue systems, but spoken dialogue systems still lack naturalness due to cascaded designs and text reliance, prompting the development of a real-time textless model that enables fluid interactions with paralinguistic signals.

Authors:Hafiz Mughees Ahmad, Dario Morle, Afshin Rahimi
Title: LayerMix: Enhanced Data Augmentation through Fractal Integration for Robust Deep Learning
Abstract:
Deep learning models have demonstrated remarkable performance across various computer vision tasks, yet their vulnerability to distribution shifts remains a critical challenge. Despite sophisticated neural network architectures, existing models often struggle to maintain consistent performance when confronted with Out-of-Distribution (OOD) samples, including natural corruptions, adversarial perturbations, and anomalous patterns. We introduce LayerMix, an innovative data augmentation approach that systematically enhances model robustness through structured fractal-based image synthesis. By meticulously integrating structural complexity into training datasets, our method generates semantically consistent synthetic samples that significantly improve neural network generalization capabilities. Unlike traditional augmentation techniques that rely on random transformations, LayerMix employs a structured mixing pipeline that preserves original image semantics while introducing controlled variability. Extensive experiments across multiple benchmark datasets, including CIFAR-10, CIFAR-100, ImageNet-200, and ImageNet-1K demonstrate LayerMixs superior performance in classification accuracy and substantially enhances critical Machine Learning (ML) safety metrics, including resilience to natural image corruptions, robustness against adversarial attacks, improved model calibration and enhanced prediction consistency. LayerMix represents a significant advancement toward developing more reliable and adaptable artificial intelligence systems by addressing the fundamental challenges of deep learning generalization. The code is available at https://github.com/ahmadmughees/layermix.
中文: LayerMix是一种基于分形结构的创新数据增强方法,通过生成语义一致的合成图像系统提升深度学习模型的泛化能力,在多个基准测试中显著提高了分类准确率和机器学习安全指标。
English: LayerMix is a structured fractal-based data augmentation method that enhances deep learning model robustness by generating semantically consistent synthetic images, significantly improving generalization and safety metrics across multiple benchmarks.

Authors:Yachuan Li, Xavier Soria Poma, Yun Bai, Qian Xiao, Chaozhi Yang, Guanlin Li, Zongmin Li
Title: EDMB: Edge Detector with Mamba
Abstract:
Transformer-based models have made significant progress in edge detection, but their high computational cost is prohibitive. Recently, vision Mamba have shown excellent ability in efficiently capturing long-range dependencies. Drawing inspiration from this, we propose a novel edge detector with Mamba, termed EDMB, to efficiently generate high-quality multi-granularity edges. In EDMB, Mamba is combined with a global-local architecture, therefore it can focus on both global information and fine-grained cues. The fine-grained cues play a crucial role in edge detection, but are usually ignored by ordinary Mamba. We design a novel decoder to construct learnable Gaussian distributions by fusing global features and fine-grained features. And the multi-grained edges are generated by sampling from the distributions. In order to make multi-granularity edges applicable to single-label data, we introduce Evidence Lower Bound loss to supervise the learning of the distributions. On the multi-label dataset BSDS500, our proposed EDMB achieves competitive single-granularity ODS 0.837 and multi-granularity ODS 0.851 without multi-scale test or extra PASCAL-VOC data. Remarkably, EDMB can be extended to single-label datasets such as NYUDv2 and BIPED. The source code is available at https://github.com/Li-yachuan/EDMB.
中文: 提出的EDMB模型结合视觉Mamba与全局-局部架构,能高效生成高质量多粒度边缘,在BSDS500上表现优异,并可有效扩展至单标签数据集。
English: The proposed EDMB model leverages vision Mamba with a global-local architecture to efficiently produce high-quality multi-granularity edges, achieving competitive performance on BSDS500 and extending effectively to single-label datasets.

Authors:Lucas Prieto, Melih Barsbey, Pedro A. M. Mediano, Tolga Birdal
Title: Grokking at the Edge of Numerical Stability
Abstract:
Grokking, the sudden generalization that occurs after prolonged overfitting, is a surprising phenomenon challenging our understanding of deep learning. Although significant progress has been made in understanding grokking, the reasons behind the delayed generalization and its dependence on regularization remain unclear. In this work, we argue that without regularization, grokking tasks push models to the edge of numerical stability, introducing floating point errors in the Softmax function, which we refer to as Softmax Collapse (SC). We demonstrate that SC prevents grokking and that mitigating SC enables grokking without regularization. Investigating the root cause of SC, we find that beyond the point of overfitting, the gradients strongly align with what we call the naïve loss minimization (NLM) direction. This component of the gradient does not alter the model's predictions but decreases the loss by scaling the logits, typically by scaling the weights along their current direction. We show that this scaling of the logits explains the delay in generalization characteristic of grokking and eventually leads to SC, halting further learning. To validate our hypotheses, we introduce two key contributions that address the challenges in grokking tasks: StableMax, a new activation function that prevents SC and enables grokking without regularization, and $\perp$Grad, a training algorithm that promotes quick generalization in grokking tasks by preventing NLM altogether. These contributions provide new insights into grokking, elucidating its delayed generalization, reliance on regularization, and the effectiveness of existing grokking-inducing methods. Code for this paper is available at https://github.com/LucasPrietoAl/grokking-at-the-edge-of-numerical-stability.
中文: 本研究揭示了浮点误差导致的Softmax崩溃是阻碍深度学习模型"顿悟"现象的关键因素,并提出了StableMax激活函数和⊥Grad训练算法,可在无需正则化的情况下实现顿悟。
English: This study identifies Softmax Collapse caused by floating point errors as the key factor preventing grokking in deep learning models and introduces StableMax activation function and ⊥Grad training algorithm to enable grokking without regularization.

Authors:Yaoxiang Wang, Haoling Li, Xin Zhang, Jie Wu, Xiao Liu, Wenxiang Hu, Zhongxin Guo, Yangyu Huang, Ying Xin, Yujiu Yang, Jinsong Su, Qi Chen, Scarlett Li
Title: EpiCoder: Encompassing Diversity and Complexity in Code Generation
Abstract:
Existing methods for code generation use code snippets as seed data, restricting the complexity and diversity of the synthesized data. In this paper, we introduce a novel feature tree-based synthesis framework, which revolves around hierarchical code features derived from high-level abstractions of code. The feature tree is constructed from raw data and refined iteratively to increase the quantity and diversity of the extracted features, which captures and recognizes more complex patterns and relationships within the code. By adjusting the depth and breadth of the sampled subtrees, our framework provides precise control over the complexity of the generated code, enabling functionalities that range from function-level operations to multi-file scenarios. We fine-tuned widely-used base models to obtain EpiCoder series, achieving state-of-the-art performance on multiple benchmarks at both the function and file levels. In particular, empirical evidence indicates that our approach shows significant potential in the synthesizing of repository-level code data. Our code and data are publicly available at https://github.com/microsoft/EpiCoder.
Chinese: 本文提出了一种基于特征树的合成框架,通过从代码高级抽象中迭代精炼层次特征来增强代码生成,实现了对复杂度的精确控制,并在多个基准测试中达到了最先进的性能。
English: This paper introduces a feature tree-based synthesis framework that enhances code generation by iteratively refining hierarchical features from high-level abstractions, enabling precise control over complexity and achieving state-of-the-art performance across multiple benchmarks.

Authors:Yaoxiang Wang, Haoling Li, Xin Zhang, Jie Wu, Xiao Liu, Wenxiang Hu, Zhongxin Guo, Yangyu Huang, Ying Xin, Yujiu Yang, Jinsong Su, Qi Chen, Scarlett Li
Title: EpiCoder: Encompassing Diversity and Complexity in Code Generation
Abstract:
Existing methods for code generation use code snippets as seed data, restricting the complexity and diversity of the synthesized data. In this paper, we introduce a novel feature tree-based synthesis framework, which revolves around hierarchical code features derived from high-level abstractions of code. The feature tree is constructed from raw data and refined iteratively to increase the quantity and diversity of the extracted features, which captures and recognizes more complex patterns and relationships within the code. By adjusting the depth and breadth of the sampled subtrees, our framework provides precise control over the complexity of the generated code, enabling functionalities that range from function-level operations to multi-file scenarios. We fine-tuned widely-used base models to obtain EpiCoder series, achieving state-of-the-art performance on multiple benchmarks at both the function and file levels. In particular, empirical evidence indicates that our approach shows significant potential in the synthesizing of repository-level code data. Our code and data are publicly available at https://github.com/microsoft/EpiCoder.
Chinese: 本文提出了一种基于特征树的合成框架,通过从代码高级抽象中迭代精炼层次特征来增强代码生成,实现了对复杂度的精确控制,并在多个基准测试中达到了最先进的性能。
English: This paper introduces a feature tree-based synthesis framework that enhances code generation by iteratively refining hierarchical features from high-level abstractions, enabling precise control over complexity and achieving state-of-the-art performance across multiple benchmarks.

Authors:Ruilin Luo, Zhuofan Zheng, Yifan Wang, Xinzhe Ni, Zicheng Lin, Songtao Jiang, Yiyao Yu, Chufan Shi, Ruihang Chu, Jin Zeng, Yujiu Yang
Title: URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics
Abstract:
Process Reward Models (PRMs) have shown promise in enhancing the mathematical reasoning capabilities of Large Language Models (LLMs) through Test-Time Scaling (TTS). However, their integration into multimodal reasoning remains largely unexplored. In this work, we take the first step toward unlocking the potential of PRMs in multimodal mathematical reasoning. We identify three key challenges: (1) the scarcity of high-quality reasoning data constrains the capabilities of foundation Multimodal Large Language Models (MLLMs), which imposes further limitations on the upper bounds of TTS and reinforcement learning (RL); (2) a lack of automated methods for process labeling within multimodal contexts persists; (3) the employment of process rewards in unimodal RL faces issues like reward hacking, which may extend to multimodal scenarios. To address these issues, we introduce URSA, a three-stage Unfolding multimodal Process-Supervision Aided training framework. We first construct MMathCoT-1M, a high-quality large-scale multimodal Chain-of-Thought (CoT) reasoning dataset, to build a stronger math reasoning foundation MLLM, URSA-8B. Subsequently, we go through an automatic process to synthesize process supervision data, which emphasizes both logical correctness and perceptual consistency. We introduce DualMath-1.1M to facilitate the training of URSA-8B-RM. Finally, we propose Process-Supervised Group-Relative-Policy-Optimization (PS-GRPO), pioneering a multimodal PRM-aided online RL method that outperforms vanilla GRPO. With PS-GRPO application, URSA-8B-PS-GRPO outperforms Gemma3-12B and GPT-4o by 8.4% and 2.7% on average across 6 benchmarks. Code, data and checkpoint can be found at https://github.com/URSA-MATH.
中文: 本研究提出URSA三阶段框架,通过构建高质量多模态推理数据集、开发自动化过程监督方法及创新强化学习算法,显著提升多模态数学推理能力,在多个基准测试中超越主流模型表现。
English: This study introduces URSA, a three-stage framework that enhances multimodal mathematical reasoning by developing a high-quality dataset, creating automated process supervision, and implementing a novel reinforcement learning method, achieving superior performance over leading models.

Authors:Ruilin Luo, Zhuofan Zheng, Yifan Wang, Xinzhe Ni, Zicheng Lin, Songtao Jiang, Yiyao Yu, Chufan Shi, Lei Wang, Ruihang Chu, Jin Zeng, Yujiu Yang
Title: Unlocking Multimodal Mathematical Reasoning via Process Reward Model
Abstract:
Process Reward Models (PRMs) have shown promise in enhancing the mathematical reasoning capabilities of Large Language Models (LLMs) through Test-Time Scaling (TTS). However, their integration into multimodal reasoning remains largely unexplored. In this work, we take the first step toward unlocking the potential of PRMs in multimodal mathematical reasoning. We identify three key challenges: (1) the scarcity of high-quality reasoning data constrains the capabilities of foundation Multimodal Large Language Models (MLLMs), which imposes further limitations on the upper bounds of TTS and reinforcement learning (RL); (2) a lack of automated methods for process labeling within multimodal contexts persists; (3) the employment of process rewards in unimodal RL faces issues like reward hacking, which may extend to multimodal scenarios. To address these issues, we introduce URSA, a three-stage Unfolding multimodal Process-Supervision Aided training framework. We first construct MMathCoT-1M, a high-quality large-scale multimodal Chain-of-Thought (CoT) reasoning dataset, to build a stronger math reasoning foundation MLLM, URSA-8B. Subsequently, we go through an automatic process to synthesize process supervision data, which emphasizes both logical correctness and perceptual consistency. We introduce DualMath-1.1M to facilitate the training of URSA-8B-RM. Finally, we propose Process-Supervised Group-Relative-Policy-Optimization (PS-GRPO), pioneering a multimodal PRM-aided online RL method that outperforms vanilla GRPO. With PS-GRPO application, URSA-8B-PS-GRPO outperforms Gemma3-12B and GPT-4o by 8.4% and 2.7% on average across 6 benchmarks. Code, data and checkpoint can be found at https://github.com/URSA-MATH.
中文: 本研究提出URSA三阶段框架,通过构建高质量多模态推理数据集、开发自动化过程监督方法及创新强化学习算法,显著提升多模态数学推理能力,在多个基准测试中超越主流模型表现。
English: This study introduces URSA, a three-stage framework that enhances multimodal mathematical reasoning by developing a high-quality dataset, creating automated process supervision, and implementing a novel reinforcement learning method, achieving superior performance over leading models.

Authors:Tarek Naous, Wei Xu
Title: On The Origin of Cultural Biases in Language Models: From Pre-training Data to Linguistic Phenomena
Abstract:
Language Models (LMs) have been shown to exhibit a strong preference towards entities associated with Western culture when operating in non-Western languages. In this paper, we aim to uncover the origins of entity-related cultural biases in LMs by analyzing several contributing factors, including the representation of entities in pre-training data and the impact of variations in linguistic phenomena across languages. We introduce CAMeL-2, a parallel Arabic-English benchmark of 58,086 entities associated with Arab and Western cultures and 367 masked natural contexts for entities. Our evaluations using CAMeL-2 reveal reduced performance gaps between cultures by LMs when tested in English compared to Arabic. We find that LMs struggle in Arabic with entities that appear at high frequencies in pre-training, where entities can hold multiple word senses. This also extends to entities that exhibit high lexical overlap with languages that are not Arabic but use the Arabic script. Further, we show how frequency-based tokenization leads to this issue in LMs, which gets worse with larger Arabic vocabularies. We will make CAMeL-2 available at: https://github.com/tareknaous/camel2
中文:语言模型在非西方语言中表现出偏向西方实体的文化偏见,CAMeL-2基准测试显示,由于基于频率的分词和词汇歧义,模型在阿拉伯语中的表现差距更为明显。
English: Language models exhibit cultural biases favoring Western entities in non-Western languages, with CAMeL-2 benchmark revealing performance gaps in Arabic due to frequency-based tokenization and lexical ambiguities.

Authors:Eric Chen, Xi Chen, Arian Maleki, Shirin Jalali
Title: Comprehensive Examination of Unrolled Networks for Solving Linear Inverse Problems
Abstract:
Unrolled networks have become prevalent in various computer vision and imaging tasks. Although they have demonstrated remarkable efficacy in solving specific computer vision and computational imaging tasks, their adaptation to other applications presents considerable challenges. This is primarily due to the multitude of design decisions that practitioners working on new applications must navigate, each potentially affecting the network's overall performance. These decisions include selecting the optimization algorithm, defining the loss function, and determining the number of convolutional layers, among others. Compounding the issue, evaluating each design choice requires time-consuming simulations to train, fine-tune the neural network, and optimize for its performance. As a result, the process of exploring multiple options and identifying the optimal configuration becomes time-consuming and computationally demanding. The main objectives of this paper are (1) to unify some ideas and methodologies used in unrolled networks to reduce the number of design choices a user has to make, and (2) to report a comprehensive ablation study to discuss the impact of each of the choices involved in designing unrolled networks and present practical recommendations based on our findings. We anticipate that this study will help scientists and engineers design unrolled networks for their applications and diagnose problems within their networks efficiently.
Chinese: 本文旨在通过统一方法和提供全面的消融研究及实用建议,简化展开网络的设计过程,从而降低其在新应用中适应时的复杂性和计算需求。
English: This paper aims to simplify the design of unrolled networks by unifying methodologies and providing a comprehensive ablation study with practical recommendations to reduce the complexity and computational demands of adapting them to new applications.

Authors:Boyang Sun, Hanzhi Chen, Stefan Leutenegger, Cesar Cadena, Marc Pollefeys, Hermann Blum
Title: FrontierNet: Learning Visual Cues to Explore
Abstract:
Exploration of unknown environments is crucial for autonomous robots; it allows them to actively reason and decide on what new data to acquire for different tasks, such as mapping, object discovery, and environmental assessment. Existing solutions, such as frontier-based exploration approaches, rely heavily on 3D map operations, which are limited by map quality and, more critically, often overlook valuable context from visual cues. This work aims at leveraging 2D visual cues for efficient autonomous exploration, addressing the limitations of extracting goal poses from a 3D map. We propose a visual-only frontier-based exploration system, with FrontierNet as its core component. FrontierNet is a learning-based model that (i) proposes frontiers, and (ii) predicts their information gain, from posed RGB images enhanced by monocular depth priors. Our approach provides an alternative to existing 3D-dependent goal-extraction approaches, achieving a 15\% improvement in early-stage exploration efficiency, as validated through extensive simulations and real-world experiments. The project is available at https://github.com/cvg/FrontierNet.
中文摘要:本研究提出FrontierNet视觉探索系统,仅通过RGB图像和深度先验识别边界并预测信息增益,在早期探索阶段比依赖三维地图的方法效率提升15%。
English Summary: This research introduces FrontierNet, a visual-only exploration system that uses posed RGB images and depth priors to identify frontiers and predict their information gain, achieving 15% higher early-stage efficiency than 3D map-based methods.

Authors:Yuhang Liu, Pengxiang Li, Zishu Wei, Congkai Xie, Xueyu Hu, Xinchen Xu, Shengyu Zhang, Xiaotian Han, Hongxia Yang, Fei Wu
Title: InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection
Abstract:
Graphical User Interface (GUI) Agents, powered by multimodal large language models (MLLMs), have shown great potential for task automation on computing devices such as computers and mobile phones. However, existing agents face challenges in multi-step reasoning and reliance on textual annotations, limiting their effectiveness. We introduce \textit{InfiGUIAgent}, an MLLM-based GUI Agent trained with a two-stage supervised fine-tuning pipeline. Stage 1 enhances fundamental skills such as GUI understanding and grounding, while Stage 2 integrates hierarchical reasoning and expectation-reflection reasoning skills using synthesized data to enable native reasoning abilities of the agents. \textit{InfiGUIAgent} achieves competitive performance on several GUI benchmarks, highlighting the impact of native reasoning skills in enhancing GUI interaction for automation tasks. Resources are available at \url{https://github.com/Reallm-Labs/InfiGUIAgent}.
中文: InfiGUIAgent是一种基于多模态大语言模型的图形界面代理,通过两阶段微调训练具备原生推理能力,在多个基准测试中表现出色,提升了自动化任务的交互效果。
English: InfiGUIAgent, an MLLM-based GUI agent trained with a two-stage fine-tuning pipeline, enhances GUI interaction through native reasoning skills and achieves competitive performance on benchmarks.

Authors:Qingmei Wang, Yuxin Wu, Yujie Long, Jing Huang, Fengyuan Ran, Bing Su, Hongteng Xu
Title: A Plug-and-Play Bregman ADMM Module for Inferring Event Branches in Temporal Point Processes
Abstract:
An event sequence generated by a temporal point process is often associated with a hidden and structured event branching process that captures the triggering relations between its historical and current events. In this study, we design a new plug-and-play module based on the Bregman ADMM (BADMM) algorithm, which infers event branches associated with event sequences in the maximum likelihood estimation framework of temporal point processes (TPPs). Specifically, we formulate the inference of event branches as an optimization problem for the event transition matrix under sparse and low-rank constraints, which is embedded in existing TPP models or their learning paradigms. We can implement this optimization problem based on subspace clustering and sparse group-lasso, respectively, and solve it using the Bregman ADMM algorithm, whose unrolling leads to the proposed BADMM module. When learning a classic TPP (e.g., Hawkes process) by the expectation-maximization algorithm, the BADMM module helps derive structured responsibility matrices in the E-step. Similarly, the BADMM module helps derive low-rank and sparse attention maps for the neural TPPs with self-attention layers. The structured responsibility matrices and attention maps, which work as learned event transition matrices, indicate event branches, e.g., inferring isolated events and those key events triggering many subsequent events. Experiments on both synthetic and real-world data show that plugging our BADMM module into existing TPP models and learning paradigms can improve model performance and provide us with interpretable structured event branches. The code is available at \url{https://github.com/qingmeiwangdaily/BADMM_TPP}.
Chinese: 本研究提出了一种即插即用的BADMM模块,通过稀疏和低秩约束优化事件转移矩阵,推断时间点过程中的结构化事件分支,从而提升模型性能并增强可解释性。
English: This study introduces a plug-and-play BADMM module that infers structured event branching in temporal point processes by optimizing event transition matrices under sparse and low-rank constraints, enhancing both model performance and interpretability.

Authors:Xinyu Guan, Li Lyna Zhang, Yifei Liu, Ning Shang, Youran Sun, Yi Zhu, Fan Yang, Mao Yang
Title: rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking
Abstract:
We present rStar-Math to demonstrate that small language models (SLMs) can rival or even surpass the math reasoning capability of OpenAI o1, without distillation from superior models. rStar-Math achieves this by exercising "deep thinking" through Monte Carlo Tree Search (MCTS), where a math policy SLM performs test-time search guided by an SLM-based process reward model. rStar-Math introduces three innovations to tackle the challenges in training the two SLMs: (1) a novel code-augmented CoT data sythesis method, which performs extensive MCTS rollouts to generate step-by-step verified reasoning trajectories used to train the policy SLM; (2) a novel process reward model training method that avoids naïve step-level score annotation, yielding a more effective process preference model (PPM); (3) a self-evolution recipe in which the policy SLM and PPM are built from scratch and iteratively evolved to improve reasoning capabilities. Through 4 rounds of self-evolution with millions of synthesized solutions for 747k math problems, rStar-Math boosts SLMs' math reasoning to state-of-the-art levels. On the MATH benchmark, it improves Qwen2.5-Math-7B from 58.8% to 90.0% and Phi3-mini-3.8B from 41.4% to 86.4%, surpassing o1-preview by +4.5% and +0.9%. On the USA Math Olympiad (AIME), rStar-Math solves an average of 53.3% (8/15) of problems, ranking among the top 20% the brightest high school math students. Code and data will be available at https://github.com/microsoft/rStar.
中文: rStar-Math 通过蒙特卡洛树搜索实现"深度思考",结合创新的数据合成、过程奖励模型和自进化方法,使小型语言模型在数学推理上超越 OpenAI o1,在 MATH 和 AIME 基准测试中达到顶尖水平。
English: rStar-Math demonstrates that small language models can outperform OpenAI o1 in math reasoning by employing Monte Carlo Tree Search for deep thinking, enhanced through innovations in data synthesis, process reward modeling, and iterative self-evolution, achieving state-of-the-art results on benchmarks like MATH and AIME.

Authors:Zhi Jin, Yuwei Qiu, Kaihao Zhang, Hongdong Li, Wenhan Luo
Title: MB-TaylorFormer V2: Improved Multi-branch Linear Transformer Expanded by Taylor Formula for Image Restoration
Abstract:
Recently, Transformer networks have demonstrated outstanding performance in the field of image restoration due to the global receptive field and adaptability to input. However, the quadratic computational complexity of Softmax-attention poses a significant limitation on its extensive application in image restoration tasks, particularly for high-resolution images. To tackle this challenge, we propose a novel variant of the Transformer. This variant leverages the Taylor expansion to approximate the Softmax-attention and utilizes the concept of norm-preserving mapping to approximate the remainder of the first-order Taylor expansion, resulting in a linear computational complexity. Moreover, we introduce a multi-branch architecture featuring multi-scale patch embedding into the proposed Transformer, which has four distinct advantages: 1) various sizes of the receptive field; 2) multi-level semantic information; 3) flexible shapes of the receptive field; 4) accelerated training and inference speed. Hence, the proposed model, named the second version of Taylor formula expansion-based Transformer (for short MB-TaylorFormer V2) has the capability to concurrently process coarse-to-fine features, capture long-distance pixel interactions with limited computational cost, and improve the approximation of the Taylor expansion remainder. Experimental results across diverse image restoration benchmarks demonstrate that MB-TaylorFormer V2 achieves state-of-the-art performance in multiple image restoration tasks, such as image dehazing, deraining, desnowing, motion deblurring, and denoising, with very little computational overhead. The source code is available at https://github.com/FVL2020/MB-TaylorFormerV2.
中文: 提出的MB-TaylorFormer V2 Transformer变体采用泰勒展开近似Softmax注意力实现线性复杂度,结合多分支架构进行多尺度处理,以极低计算成本在多种图像复原任务中达到最优性能。
English: The proposed MB-TaylorFormer V2 Transformer variant uses Taylor expansion to approximate Softmax-attention with linear complexity and a multi-branch architecture for multi-scale processing, achieving state-of-the-art results in various image restoration tasks with minimal computational cost.

Authors:Xin Zhang, Xue Yang, Yuxuan Li, Jian Yang, Ming-Ming Cheng, Xiang Li
Title: RSAR: Restricted State Angle Resolver and Rotated SAR Benchmark
Abstract:
Rotated object detection has made significant progress in the optical remote sensing. However, advancements in the Synthetic Aperture Radar (SAR) field are laggard behind, primarily due to the absence of a large-scale dataset. Annotating such a dataset is inefficient and costly. A promising solution is to employ a weakly supervised model (e.g., trained with available horizontal boxes only) to generate pseudo-rotated boxes for reference before manual calibration. Unfortunately, the existing weakly supervised models exhibit limited accuracy in predicting the object's angle. Previous works attempt to enhance angle prediction by using angle resolvers that decouple angles into cosine and sine encodings. In this work, we first reevaluate these resolvers from a unified perspective of dimension mapping and expose that they share the same shortcomings: these methods overlook the unit cycle constraint inherent in these encodings, easily leading to prediction biases. To address this issue, we propose the Unit Cycle Resolver, which incorporates a unit circle constraint loss to improve angle prediction accuracy. Our approach can effectively improve the performance of existing state-of-the-art weakly supervised methods and even surpasses fully supervised models on existing optical benchmarks (i.e., DOTA-v1.0 dataset). With the aid of UCR, we further annotate and introduce RSAR, the largest multi-class rotated SAR object detection dataset to date. Extensive experiments on both RSAR and optical datasets demonstrate that our UCR enhances angle prediction accuracy. Our dataset and code can be found at: https://github.com/zhasion/RSAR.
中文: 本文提出单位圆解析器(UCR),通过引入单位圆约束损失来解决弱监督旋转目标检测中的角度预测偏差问题,不仅提升了预测精度,还建立了迄今最大的多类别旋转SAR数据集RSAR。
English: This paper introduces the Unit Cycle Resolver (UCR) to address angle prediction biases in weakly supervised rotated object detection by incorporating a unit circle constraint loss, which improves accuracy and enables the creation of the largest multi-class rotated SAR dataset, RSAR.

Authors:Paweł Batorski, Jannik Brinkmann, Paul Swoboda
Title: NSA: Neuro-symbolic ARC Challenge
Abstract:
The Abstraction and Reasoning Corpus (ARC) evaluates general reasoning capabilities that are difficult for both machine learning models and combinatorial search methods. We propose a neuro-symbolic approach that combines a transformer for proposal generation with combinatorial search using a domain-specific language. The transformer narrows the search space by proposing promising search directions, which allows the combinatorial search to find the actual solution in short time. We pre-train the trainsformer with synthetically generated data. During test-time we generate additional task-specific training tasks and fine-tune our model. Our results surpass comparable state of the art on the ARC evaluation set by 27% and compare favourably on the ARC train set. We make our code and dataset publicly available at https://github.com/Batorskq/NSA.
中文: 本文提出了一种神经符号方法,结合Transformer生成搜索建议和组合搜索,在抽象与推理语料库任务中取得显著成效,性能超越现有最佳方法27%。
English: This paper introduces a neuro-symbolic method that uses a transformer to propose search directions and combinatorial search to efficiently solve tasks in the Abstraction and Reasoning Corpus, achieving a 27% improvement over state-of-the-art results.

Authors:Falguni Roy, Yiduo Shen, Na Zhao, Xiaofeng Ding, Md. Omar Faruk
Title: A Closer Look on Gender Stereotypes in Movie Recommender Systems and Their Implications with Privacy
Abstract:
The movie recommender system typically leverages user feedback to provide personalized recommendations that align with user preferences and increase business revenue. This study investigates the impact of gender stereotypes on such systems through a specific attack scenario. In this scenario, an attacker determines users' gender, a private attribute, by exploiting gender stereotypes about movie preferences and analyzing users' feedback data, which is either publicly available or observed within the system. The study consists of two phases. In the first phase, a user study involving 630 participants identified gender stereotypes associated with movie genres, which often influence viewing choices. In the second phase, four inference algorithms were applied to detect gender stereotypes by combining the findings from the first phase with users' feedback data. Results showed that these algorithms performed more effectively than relying solely on feedback data for gender inference. Additionally, we quantified the extent of gender stereotypes to evaluate their broader impact on digital computational science. The latter part of the study utilized two major movie recommender datasets: MovieLens 1M and Yahoo!Movie. Detailed experimental information is available on our GitHub repository: https://github.com/fr-iit/GSMRS
中文摘要:本研究探讨了如何利用电影偏好中的性别刻板印象,通过推荐系统推断用户的私密性别信息,并证明结合刻板印象的算法比仅使用反馈数据的算法表现更优。
English Summary: This study explores how gender stereotypes in movie preferences can be exploited to infer users' private gender information through recommender systems, demonstrating that algorithms incorporating stereotypes outperform those using only feedback data.

Authors:Sofie Verhees, Chandrasekhar Venkataraman, Mariya Ptashnyk
Title: Mathematical Modelling of Mechanotransduction via RhoA Signalling Pathways
Abstract:
We derive and simulate a mathematical model for mechanotransduction related to the Rho GTPase signalling pathway. The model addresses the bidirectional coupling between signalling processes and cell mechanics. A numerical method based on bulk-surface finite elements is proposed for the approximation of the coupled system of nonlinear reaction-diffusion equations, defined inside the cell and on the cell membrane, and the equations of elasticity. Our simulation results illustrate novel emergent features such as the strong dependence of the dynamics on cell shape, a threshold-like response to changes in substrate stiffness, and the fact that coupling mechanics and signalling can lead to the robustness of cell deformation to larger changes in substrate stiffness, ensuring mechanical homeostasis in agreement with experiments.
中文摘要:本研究建立了Rho GTPase机械转导的数学模型,通过有限元方法模拟信号传导与细胞力学的耦合作用,揭示了细胞形状依赖性动态变化及刚度响应性机械稳态等新特征。
English Summary: This study develops a mathematical model of Rho GTPase mechanotransduction that couples signaling with cell mechanics, using a finite element method to simulate nonlinear interactions which reveal shape-dependent dynamics and stiffness-responsive mechanical homeostasis.

Authors:Yucheng Ruan, Daniel J. Tan, See Kiong Ng, Ling Huang, Mengling Feng
Title: Towards accurate and reliable ICU outcome prediction: a multimodal learning framework based on belief function theory using structured EHRs and free-text notes
Abstract:
Accurate Intensive Care Unit (ICU) outcome prediction is critical for improving patient treatment quality and ICU resource allocation. Existing research mainly focuses on structured data, e.g. demographics and vital signs, and lacks effective frameworks to integrate clinical notes from heterogeneous electronic health records (EHRs). This study aims to explore a multimodal framework based on belief function theory that can effectively fuse heterogeneous structured EHRs and free-text notes for accurate and reliable ICU outcome prediction. The fusion strategy accounts for prediction uncertainty within each modality and conflicts between multimodal data. The experiments on MIMIC-III dataset show that our framework provides more accurate and reliable predictions than existing approaches. Specifically, it outperformed the best baseline by 1.05%/1.02% in BACC, 9.74%/6.04% in F1 score, 1.28%/0.9% in AUROC, and 6.21%/2.68% in AUPRC for predicting mortality and PLOS, respectively. Additionally, it improved the reliability of the predictions with a 26.8%/15.1% reduction in the Brier score and a 25.0%/13.3% reduction in negative log-likelihood. By effectively reducing false positives, the model can aid in better allocation of medical resources in the ICU. Furthermore, the proposed method is very versatile and can be extended to analyzing multimodal EHRs for other clinical tasks. The code implementation is available on https://github.com/yuchengruan/evid_multimodal_ehr.
Chinese: 本研究提出了一种基于置信函数理论的多模态框架,有效融合结构化电子健康记录和临床笔记,提高了ICU结果预测的准确性和可靠性,在MIMIC-III数据集上表现优于现有方法。
English: This study introduces a multimodal framework based on belief function theory that integrates structured EHRs and clinical notes to enhance ICU outcome prediction accuracy and reliability, outperforming existing methods on the MIMIC-III dataset.

Authors:Feng Liu, Bao Deng, Rui Su, Lei Bai, Wanli Ouyang
Title: DispFormer: A Pretrained Transformer Incorporating Physical Constraints for Dispersion Curve Inversion
Abstract:
Surface wave dispersion curve inversion is crucial for estimating subsurface shear-wave velocity (vs), yet traditional methods often face challenges related to computational cost, non-uniqueness, and sensitivity to initial models. While deep learning approaches show promise, many require large labeled datasets and struggle with real-world datasets, which often exhibit varying period ranges, missing values, and low signal-to-noise ratios. To address these limitations, this study introduces DispFormer, a transformer-based neural network for $v_s$ profile inversion from Rayleigh-wave phase and group dispersion curves. DispFormer processes dispersion data independently at each period, allowing it to handle varying lengths without requiring network modifications or strict alignment between training and testing datasets. A depth-aware training strategy is also introduced, incorporating physical constraints derived from the depth sensitivity of dispersion data. DispFormer is pre-trained on a global synthetic dataset and evaluated on two regional synthetic datasets using zero-shot and few-shot strategies. Results show that even without labeled data, the zero-shot DispFormer generates inversion profiles that outperform the interpolated reference model used as the pretraining target, providing a deployable initial model generator to assist traditional workflows. When partial labeled data available, the few-shot trained DispFormer surpasses traditional global search methods. Real-world tests further confirm that DispFormer generalizes well to dispersion data with varying lengths and achieves lower data residuals than reference models. These findings underscore the potential of DispFormer as a foundation model for dispersion curve inversion and demonstrate the advantages of integrating physics-informed deep learning into geophysical applications.
中文: 本研究提出DispFormer这一基于Transformer的神经网络,能够有效反演面波频散曲线获取地下横波速度剖面,通过零样本和小样本学习在合成和实际数据中表现出优越的泛化能力,解决了传统方法在计算成本和数据适应性方面的局限。
English: This study introduces DispFormer, a transformer-based neural network that effectively inverts surface wave dispersion curves for subsurface shear-wave velocity profiles, overcoming limitations of traditional methods and demonstrating strong generalization with zero-shot and few-shot learning on synthetic and real-world datasets.

Authors:Michal Nohel, Constantin Ulrich, Jonathan Suprijadi, Tassilo Wald, Klaus Maier-Hein
Title: A Unified Framework for Foreground and Anonymization Area Segmentation in CT and MRI Data
Abstract:
This study presents an open-source toolkit to address critical challenges in preprocessing data for self-supervised learning (SSL) for 3D medical imaging, focusing on data privacy and computational efficiency. The toolkit comprises two main components: a segmentation network that delineates foreground regions to optimize data sampling and thus reduce training time, and a segmentation network that identifies anonymized regions, preventing erroneous supervision in reconstruction-based SSL methods. Experimental results demonstrate high robustness, with mean Dice scores exceeding 98.5 across all anonymization methods and surpassing 99.5 for foreground segmentation tasks, highlighting the efficacy of the toolkit in supporting SSL applications in 3D medical imaging for both CT and MRI images. The weights and code is available at https://github.com/MIC-DKFZ/Foreground-and-Anonymization-Area-Segmentation.
Chinese: 该开源工具包通过优化数据采样和匿名化处理,提升了3D医学影像的自监督学习效率,在CT和MRI应用中平均Dice分数超过98.5,表现出卓越的鲁棒性。
English: This open-source toolkit enhances self-supervised learning for 3D medical imaging by optimizing data sampling and anonymization, achieving over 98.5 mean Dice scores for robust performance in CT and MRI applications.

Authors:Xueqiang Ouyang, Jia Wei, Wenjie Huo, Xiaocong Wang, Rui Li, Jianlong Zhou
Title: DeFusion: An Effective Decoupling Fusion Network for Multi-Modal Pregnancy Prediction
Abstract:
Temporal embryo images and parental fertility table indicators are both valuable for pregnancy prediction in \textbf{in vitro fertilization embryo transfer} (IVF-ET). However, current machine learning models cannot make full use of the complementary information between the two modalities to improve pregnancy prediction performance. In this paper, we propose a Decoupling Fusion Network called DeFusion to effectively integrate the multi-modal information for IVF-ET pregnancy prediction. Specifically, we propose a decoupling fusion module that decouples the information from the different modalities into related and unrelated information, thereby achieving a more delicate fusion. And we fuse temporal embryo images with a spatial-temporal position encoding, and extract fertility table indicator information with a table transformer. To evaluate the effectiveness of our model, we use a new dataset including 4046 cases collected from Southern Medical University. The experiments show that our model outperforms state-of-the-art methods. Meanwhile, the performance on the eye disease prediction dataset reflects the model's good generalization. Our code is available at https://github.com/Ou-Young-1999/DFNet.
中文: 提出的DeFusion网络通过解耦融合和专门编码,有效整合胚胎时序图像与生育表格数据,在IVF-ET妊娠预测中实现优越性能并展现良好泛化能力。
English: The proposed DeFusion network effectively integrates temporal embryo images and fertility table data through decoupled fusion and specialized encoding, achieving superior IVF-ET pregnancy prediction performance with demonstrated generalization capability.

Authors:Clément Fuchs, Maxime Zanella, Christophe De Vleeschouwer
Title: Online Gaussian Test-Time Adaptation of Vision-Language Models
Abstract:
Online test-time adaptation (OTTA) of vision-language models (VLMs) has recently garnered increased attention to take advantage of data observed along a stream to improve future predictions. Unfortunately, existing methods rely on dataset-specific hyperparameters, significantly limiting their adaptability to unseen tasks. In response, we propose Online Gaussian Adaptation (OGA), a novel method that models the likelihoods of visual features using Gaussian distributions and incorporates zero-shot priors into an interpretable Maximum A Posteriori (MAP) estimation framework with fixed hyper-parameters across all datasets. We demonstrate that OGA outperforms state-of-the-art methods on most datasets and runs. Additionally, we show that combining OTTA with popular few-shot techniques (a practical yet overlooked setting in prior research) is highly beneficial. Furthermore, our experimental study reveals that common OTTA evaluation protocols, which average performance over at most three runs per dataset, are inadequate due to the substantial variability observed across runs for all OTTA methods. Therefore, we advocate for more rigorous evaluation practices, including increasing the number of runs and considering additional quantitative metrics, such as our proposed Expected Tail Accuracy (ETA), calculated as the average accuracy in the worst 10% of runs. We hope these contributions will encourage more rigorous and diverse evaluation practices in the OTTA community. Code is available at https://github.com/cfuchs2023/OGA .
在线视觉语言模型的测试时自适应方法通过提出的在线高斯自适应得到改进,该方法采用固定超参数并融入零样本先验,在多数数据集上优于现有技术,同时因运行间性能差异显著,强调了需要更严格评估实践的必要性。
Online test-time adaptation for vision-language models is advanced by the proposed Online Gaussian Adaptation method, which uses fixed hyperparameters and integrates zero-shot priors, outperforming existing techniques and highlighting the need for more rigorous evaluation practices due to significant performance variability across runs.

Authors:Qiang Sun, Sirui Li, Du Huynh, Mark Reynolds, Wei Liu
Title: TimelineKGQA: A Comprehensive Question-Answer Pair Generator for Temporal Knowledge Graphs
Abstract:
Question answering over temporal knowledge graphs (TKGs) is crucial for understanding evolving facts and relationships, yet its development is hindered by limited datasets and difficulties in generating custom QA pairs. We propose a novel categorization framework based on timeline-context relationships, along with \textbf{TimelineKGQA}, a universal temporal QA generator applicable to any TKGs. The code is available at: \url{https://github.com/PascalSun/TimelineKGQA} as an open source Python package.
中文摘要:本研究提出了一种新的分类框架和TimelineKGQA通用生成器,用于时序知识图谱问答,旨在解决数据集限制并改进自定义问答对的生成。
English Summary: This study introduces a new categorization framework and TimelineKGQA, a universal generator for temporal question answering over knowledge graphs, addressing dataset limitations and enhancing custom QA pair creation.

Authors:Dong-Hai Zhu, Yu-Jie Xiong, Jia-Chen Zhang, Xi-Jiong Xie, Chun-Ming Xia
Title: Understanding Before Reasoning: Enhancing Chain-of-Thought with Iterative Summarization Pre-Prompting
Abstract:
Chain-of-Thought (CoT) Prompting is a dominant paradigm in Large Language Models (LLMs) to enhance complex reasoning. It guides LLMs to present multi-step reasoning, rather than generating the final answer directly. However, CoT encounters difficulties when key information required for reasoning is implicit or missing. This occurs because CoT emphasizes the sequence of reasoning steps while overlooking the early extraction of essential information. We propose a pre-prompting method called Iterative Summarization Pre-Prompting (ISP^2) to refine LLM reasoning when key information is not explicitly provided. First, entities and their corresponding descriptions are extracted to form potential key information pairs. Next, we use a reliability rating to assess these pairs, then merge the two lowest-ranked pairs into a new entity description. This process is repeated until a unique key information pair is obtained. Finally, that pair, along with the original question, is fed into LLMs to produce the answer. Extensive experiments demonstrate a 7.1% improvement compared to existing methods. Unlike traditional prompting, ISP^2 adopts an inductive approach with pre-prompting, offering flexible integration into diverse reasoning frameworks. The code is available at https://github.com/zdhgreat/ISP-2.
中文: 针对思维链提示在关键信息缺失或隐含时推理困难的问题,提出的迭代摘要预提示方法通过逐步提取和精炼关键信息对来优化大语言模型的推理能力,相比现有方法性能提升了7.1%。
English: Chain-of-Thought prompting struggles with implicit or missing key information in reasoning, so the proposed Iterative Summarization Pre-Prompting (ISP²) method enhances LLM performance by iteratively extracting and refining essential information pairs before generating answers, achieving a 7.1% improvement over existing methods.

Authors:Miao Rang, Zhenni Bi, Chuanjian Liu, Yehui Tang, Kai Han, Yunhe Wang
Title: Eve: Efficient Multimodal Vision Language Models with Elastic Visual Experts
Abstract:
Multimodal vision language models (VLMs) have made significant progress with the support of continuously increasing model sizes and data volumes. Running VLMs on edge devices has become a challenge for their widespread application. There are several efficient VLM efforts, but they often sacrifice linguistic capabilities to enhance multimodal abilities, or require extensive training. To address this quandary,we introduce the innovative framework of Efficient Vision Language Models with Elastic Visual Experts (Eve). By strategically incorporating adaptable visual expertise at multiple stages of training, Eve strikes a balance between preserving linguistic abilities and augmenting multimodal capabilities. This balanced approach results in a versatile model with only 1.8B parameters that delivers significant improvements in both multimodal and linguistic tasks. Notably, in configurations below 3B parameters, Eve distinctly outperforms in language benchmarks and achieves state-of-the-art results 68.87% in VLM Benchmarks. Additionally, its multimodal accuracy outstrips that of the larger 7B LLaVA-1.5 model. Our code is available at https://github.com/rangmiao/Eve.
Chinese: 创新的高效视觉语言模型与弹性视觉专家(Eve)框架,在仅1.8B参数下实现了语言能力与多模态性能的平衡,在语言基准和视觉语言模型基准中均取得卓越成果。
English: The innovative framework of Efficient Vision Language Models with Elastic Visual Experts (Eve) balances linguistic preservation and multimodal enhancement, achieving superior performance with only 1.8B parameters in both language and VLM benchmarks.

Authors:Ziming Luo, Zonglin Yang, Zexin Xu, Wei Yang, Xinya Du
Title: LLM4SR: A Survey on Large Language Models for Scientific Research
Abstract:
In recent years, the rapid advancement of Large Language Models (LLMs) has transformed the landscape of scientific research, offering unprecedented support across various stages of the research cycle. This paper presents the first systematic survey dedicated to exploring how LLMs are revolutionizing the scientific research process. We analyze the unique roles LLMs play across four critical stages of research: hypothesis discovery, experiment planning and implementation, scientific writing, and peer reviewing. Our review comprehensively showcases the task-specific methodologies and evaluation benchmarks. By identifying current challenges and proposing future research directions, this survey not only highlights the transformative potential of LLMs, but also aims to inspire and guide researchers and practitioners in leveraging LLMs to advance scientific inquiry. Resources are available at the following repository: https://github.com/du-nlp-lab/LLM4SR
中文: 本文首次系统综述了大语言模型如何通过推动假设发现、实验规划、科学写作和同行评审等关键环节来变革科研流程,同时指出了当前挑战与未来研究方向。
English: This paper provides the first systematic survey on how Large Language Models are revolutionizing scientific research by analyzing their roles in hypothesis discovery, experiment planning, scientific writing, and peer review, while identifying challenges and future directions.

Authors:Hyogon Ryu, NaHyeon Park, Hyunjung Shim
Title: DGQ: Distribution-Aware Group Quantization for Text-to-Image Diffusion Models
Abstract:
Despite the widespread use of text-to-image diffusion models across various tasks, their computational and memory demands limit practical applications. To mitigate this issue, quantization of diffusion models has been explored. It reduces memory usage and computational costs by compressing weights and activations into lower-bit formats. However, existing methods often struggle to preserve both image quality and text-image alignment, particularly in lower-bit($<$ 8bits) quantization. In this paper, we analyze the challenges associated with quantizing text-to-image diffusion models from a distributional perspective. Our analysis reveals that activation outliers play a crucial role in determining image quality. Additionally, we identify distinctive patterns in cross-attention scores, which significantly affects text-image alignment. To address these challenges, we propose Distribution-aware Group Quantization (DGQ), a method that identifies and adaptively handles pixel-wise and channel-wise outliers to preserve image quality. Furthermore, DGQ applies prompt-specific logarithmic quantization scales to maintain text-image alignment. Our method demonstrates remarkable performance on datasets such as MS-COCO and PartiPrompts. We are the first to successfully achieve low-bit quantization of text-to-image diffusion models without requiring additional fine-tuning of weight quantization parameters. Code is available at https://github.com/ugonfor/DGQ.
中文: 本文提出的分布感知分组量化(DGQ)方法通过自适应处理激活异常值并采用提示特定的对数量化尺度,成功解决了文本到图像扩散模型在低位量化中保持图像质量和图文一致性的难题,无需额外微调即可实现高效压缩。
English: This paper introduces Distribution-aware Group Quantization (DGQ), a novel method that effectively addresses the challenges of low-bit quantization in text-to-image diffusion models by adaptively handling activation outliers and applying prompt-specific logarithmic scales to preserve both image quality and text-image alignment without additional fine-tuning.

Authors:Youran Zhou, Mohamed Reda Bouadjenek, Jonathan Wells, Sunil Aryal
Title: HI-PMK: A Data-Dependent Kernel for Incomplete Heterogeneous Data Representation
Abstract:
Handling incomplete and heterogeneous data remains a central challenge in real-world machine learning, where missing values may follow complex mechanisms (MCAR, MAR, MNAR) and features can be of mixed types (numerical and categorical). Existing methods often rely on imputation, which may introduce bias or privacy risks, or fail to jointly address data heterogeneity and structured missingness. We propose the \textbf{H}eterogeneous \textbf{I}ncomplete \textbf{P}robability \textbf{M}ass \textbf{K}ernel (\textbf{HI-PMK}), a novel data-dependent representation learning approach that eliminates the need for imputation. HI-PMK introduces two key innovations: (1) a probability mass-based dissimilarity measure that adapts to local data distributions across heterogeneous features (numerical, ordinal, nominal), and (2) a missingness-aware uncertainty strategy (MaxU) that conservatively handles all three missingness mechanisms by assigning maximal plausible dissimilarity to unobserved entries. Our approach is privacy-preserving, scalable, and readily applicable to downstream tasks such as classification and clustering. Extensive experiments on over 15 benchmark datasets demonstrate that HI-PMK consistently outperforms traditional imputation-based pipelines and kernel methods across a wide range of missing data settings. Code is available at: https://github.com/echoid/Incomplete-Heter-Kernel
中文: 提出的HI-PMK方法通过基于概率质量的差异度量和缺失感知策略,解决了不完整和异构数据的挑战,无需数据填补即可在不同缺失数据场景中超越现有方法。
English: The proposed HI-PMK method addresses incomplete and heterogeneous data challenges through a probability mass-based dissimilarity measure and missingness-aware strategy, eliminating imputation needs while outperforming existing methods across various missing data scenarios.

Authors:Hyungjin Chung, Dohun Lee, Zihui Wu, Byung-Hoon Kim, Katherine L. Bouman, Jong Chul Ye
Title: ContextMRI: Enhancing Compressed Sensing MRI through Metadata Conditioning
Abstract:
Compressed sensing MRI seeks to accelerate MRI acquisition processes by sampling fewer k-space measurements and then reconstructing the missing data algorithmically. The success of these approaches often relies on strong priors or learned statistical models. While recent diffusion model-based priors have shown great potential, previous methods typically ignore clinically available metadata (e.g. patient demographics, imaging parameters, slice-specific information). In practice, metadata contains meaningful cues about the anatomy and acquisition protocol, suggesting it could further constrain the reconstruction problem. In this work, we propose ContextMRI, a text-conditioned diffusion model for MRI that integrates granular metadata into the reconstruction process. We train a pixel-space diffusion model directly on minimally processed, complex-valued MRI images. During inference, metadata is converted into a structured text prompt and fed to the model via CLIP text embeddings. By conditioning the prior on metadata, we unlock more accurate reconstructions and show consistent gains across multiple datasets, acceleration factors, and undersampling patterns. Our experiments demonstrate that increasing the fidelity of metadata, ranging from slice location and contrast to patient age, sex, and pathology, systematically boosts reconstruction performance. This work highlights the untapped potential of leveraging clinical context for inverse problems and opens a new direction for metadata-driven MRI reconstruction.
中文: ContextMRI是一种新型扩散模型,通过将临床元数据转化为文本提示来增强MRI重建效果,在多种数据集和加速因子下均实现了更精准的图像恢复。
English: ContextMRI is a novel diffusion model that enhances MRI reconstruction by incorporating clinical metadata as text prompts, achieving superior accuracy across various datasets and acceleration factors.

Authors:Yuze Wang, Rong Xiao, Haifeng Li, Mariana Belgiu, Chao Tao
Title: Enhancing Scene Classification in Cloudy Image Scenarios: A Collaborative Transfer Method with Information Regulation Mechanism using Optical Cloud-Covered and SAR Remote Sensing Images
Abstract:
In remote sensing scene classification, leveraging the transfer methods with well-trained optical models is an efficient way to overcome label scarcity. However, cloud contamination leads to optical information loss and significant impacts on feature distribution, challenging the reliability and stability of transferred target models. Common solutions include cloud removal for optical data or directly using Synthetic aperture radar (SAR) data in the target domain. However, cloud removal requires substantial auxiliary data for support and pre-training, while directly using SAR disregards the unobstructed portions of optical data. This study presents a scene classification transfer method that synergistically combines multi-modality data, which aims to transfer the source domain model trained on cloudfree optical data to the target domain that includes both cloudy optical and SAR data at low cost. Specifically, the framework incorporates two parts: (1) the collaborative transfer strategy, based on knowledge distillation, enables the efficient prior knowledge transfer across heterogeneous data; (2) the information regulation mechanism (IRM) is proposed to address the modality imbalance issue during transfer. It employs auxiliary models to measure the contribution discrepancy of each modality, and automatically balances the information utilization of modalities during the target model learning process at the sample-level. The transfer experiments were conducted on simulated and real cloud datasets, demonstrating the superior performance of the proposed method compared to other solutions in cloud-covered scenarios. We also verified the importance and limitations of IRM, and further discussed and visualized the modality imbalance problem during the model transfer. Codes are available at https://github.com/wangyuze-csu/ESCCS
中文摘要:本研究提出一种遥感场景分类的迁移学习方法,通过协同利用多模态数据,结合知识蒸馏策略和信息调节机制,有效解决云层覆盖下光学与SAR数据的模态不平衡问题,在云污染场景中展现出优越性能。
English Summary: This study introduces a transfer learning method for remote sensing scene classification that synergistically utilizes multi-modal data, combining cloudy optical and SAR imagery through knowledge distillation and an information regulation mechanism to address modality imbalance and enhance performance in cloud-covered scenarios.

Authors:Siddharth Joshi, Besmira Nushi, Vidhisha Balachandran, Varun Chandrasekaran, Vibhav Vineet, Neel Joshi, Baharan Mirzasoleiman
Title: MM-GEN: Enhancing Task Performance Through Targeted Multimodal Data Curation
Abstract:
Vision-language models (VLMs) are highly effective but often underperform on specialized tasks; for example, Llava-1.5 struggles with chart and diagram understanding due to scarce task-specific training data. Existing training data, sourced from general-purpose datasets, fails to capture the nuanced details needed for these tasks. We introduce MM-Gen, a scalable method that generates task-specific, high-quality synthetic text for candidate images by leveraging stronger models. MM-Gen employs a three-stage targeted process: partitioning data into subgroups, generating targeted text based on task descriptions, and filtering out redundant and outlier data. Fine-tuning VLMs with data generated by MM-Gen leads to significant performance gains, including 29% on spatial reasoning and 15% on diagram understanding for Llava-1.5 (7B). Compared to human-curated caption data, MM-Gen achieves up to 1.6x better improvements for the original models, proving its effectiveness in enhancing task-specific VLM performance and bridging the gap between general-purpose datasets and specialized requirements. Code available at https://github.com/sjoshi804/MM-Gen.
Chinese: MM-Gen是一种可扩展的方法,通过为图像生成高质量的合成文本来提升视觉语言模型在专业任务上的性能,例如使Llava-1.5在空间推理和图表理解方面分别实现了29%和15%的显著提升。
English: MM-Gen is a scalable method that generates high-quality synthetic text for images to enhance vision-language models' performance on specialized tasks, achieving significant improvements such as 29% on spatial reasoning and 15% on diagram understanding for Llava-1.5.

Authors:Kam Woh Ng, Jing Yang, Jia Wei Sii, Jiankang Deng, Chee Seng Chan, Yi-Zhe Song, Tao Xiang, Xiatian Zhu
Title: Chirpy3D: Creative Fine-grained 3D Object Fabrication via Part Sampling
Abstract:
We present Chirpy3D, a novel approach for fine-grained 3D object generation, tackling the challenging task of synthesizing creative 3D objects in a zero-shot setting, with access only to unposed 2D images of seen categories. Without structured supervision -- such as camera poses, 3D part annotations, or object-specific labels -- the model must infer plausible 3D structures, capture fine-grained details, and generalize to novel objects using only category-level labels from seen categories. To address this, Chirpy3D introduces a multi-view diffusion model that decomposes training objects into anchor parts in an unsupervised manner, representing the latent space of both seen and unseen parts as continuous distributions. This allows smooth interpolation and flexible recombination of parts to generate entirely new objects with species-specific details. A self-supervised feature consistency loss further ensures structural and semantic coherence. The result is the first system capable of generating entirely novel 3D objects with species-specific fine-grained details through flexible part sampling and composition. Our experiments demonstrate that Chirpy3D surpasses existing methods in generating creative 3D objects with higher quality and fine-grained details. Code will be released at https://github.com/kamwoh/chirpy3d.
中文:Chirpy3D是一种零样本三维物体生成方法,通过无监督部件分解和多视角扩散模型,能够生成具有精细细节的全新三维物体,在质量和创造性上超越了现有方法。
English: Chirpy3D is a zero-shot 3D object generation method that uses unsupervised part decomposition and a multi-view diffusion model to create novel 3D objects with fine-grained details, outperforming existing approaches in quality and creativity.

Authors:Xiaoqing Zhang, Ang Lv, Yuhan Liu, Flood Sung, Wei Liu, Jian Luan, Shuo Shang, Xiuying Chen, Rui Yan
Title: More is not always better? Enhancing Many-Shot In-Context Learning with Differentiated and Reweighting Objectives
Abstract:
Large language models (LLMs) excel at few-shot in-context learning (ICL) without requiring parameter updates. However, as ICL demonstrations increase from a few to many, performance tends to plateau and eventually decline. We identify two primary causes for this trend: the suboptimal negative log-likelihood (NLL) optimization objective and the incremental data noise. To address these issues, we introduce \textit{DrICL}, a novel optimization method that enhances model performance through \textit{Differentiated} and \textit{Reweighting} objectives. Globally, DrICL utilizes differentiated learning to optimize the NLL objective, ensuring that many-shot performance surpasses zero-shot levels. Locally, it dynamically adjusts the weighting of many-shot demonstrations by leveraging cumulative advantages inspired by reinforcement learning, thereby mitigating the impact of noisy data. Recognizing the lack of multi-task datasets with diverse many-shot distributions, we develop the \textit{Many-Shot ICL Benchmark} (ICL-50)-a large-scale benchmark of 50 tasks that cover shot numbers from 1 to 350 within sequences of up to 8,000 tokens-for both fine-tuning and evaluation purposes. Experimental results demonstrate that LLMs enhanced with DrICL achieve significant improvements in many-shot setups across various tasks, including both in-domain and out-of-domain scenarios. We release the code and dataset hoping to facilitate further research in many-shot ICL\footnote{https://github.com/xiaoqzhwhu/DrICL}.
中文: 大语言模型在多样本上下文学习中因优化目标不理想和数据噪声导致性能下降,而提出的DrICL方法通过差异化学习和动态加权机制有效解决这些问题,在新构建的基准测试中显著提升了多样本场景下的任务表现。
English: Large language models face performance degradation in many-shot in-context learning due to suboptimal optimization objectives and data noise, which the proposed DrICL method addresses through differentiated learning and dynamic demonstration reweighting, achieving significant improvements across tasks as validated on a newly developed benchmark.

Authors:Yuqi Li, Xingyou Lin, Kai Zhang, Chuanguang Yang, Zhongliang Guo, Jianping Gou, Yanli Li
Title: FedKD-hybrid: Federated Hybrid Knowledge Distillation for Lithography Hotspot Detection
Abstract:
Federated Learning (FL) provides novel solutions for machine learning (ML)-based lithography hotspot detection (LHD) under distributed privacy-preserving settings. Currently, two research pipelines have been investigated to aggregate local models and achieve global consensus, including parameter/nonparameter based (also known as knowledge distillation, namely KD). While these two kinds of methods show effectiveness in specific scenarios, we note they have not fully utilized and transferred the information learned, leaving the potential of FL-based LDH remains unexplored. Thus, we propose FedKDhybrid in this study to mitigate the research gap. Specifically, FedKD-hybrid clients agree on several identical layers across all participants and a public dataset for achieving global consensus. During training, the trained local model will be evaluated on the public dataset, and the generated logits will be uploaded along with the identical layer parameters. The aggregated information is consequently used to update local models via the public dataset as a medium. We compare our proposed FedKD-hybrid with several state-of-the-art (SOTA) FL methods under ICCAD-2012 and FAB (real-world collected) datasets with different settings; the experimental results demonstrate the superior performance of the FedKD-hybrid algorithm. Our code is available at https://github.com/itsnotacie/NN-FedKD-hybrid
中文: 本研究提出FedKD-hybrid方法,通过结合参数与知识蒸馏策略,利用共享层和公共数据集提升联邦学习在光刻热点检测中的全局共识能力,实验证明其性能优于现有先进方法。
English: This study introduces FedKD-hybrid, a federated learning method that combines parameter and knowledge distillation approaches to enhance lithography hotspot detection by utilizing shared layers and a public dataset for improved global model consensus, outperforming existing methods in experiments.

Authors:Rui Liu, Hongyu Yuan, Haizhou Li
Title: Listening and Seeing Again: Generative Error Correction for Audio-Visual Speech Recognition
Abstract:
Unlike traditional Automatic Speech Recognition (ASR), Audio-Visual Speech Recognition (AVSR) takes audio and visual signals simultaneously to infer the transcription. Recent studies have shown that Large Language Models (LLMs) can be effectively used for Generative Error Correction (GER) in ASR by predicting the best transcription from ASR-generated N-best hypotheses. However, these LLMs lack the ability to simultaneously understand audio and visual, making the GER approach challenging to apply in AVSR. In this work, we propose a novel GER paradigm for AVSR, termed AVGER, that follows the concept of ``listening and seeing again''. Specifically, we first use the powerful AVSR system to read the audio and visual signals to get the N-Best hypotheses, and then use the Q-former-based Multimodal Synchronous Encoder to read the audio and visual information again and convert them into an audio and video compression representation respectively that can be understood by LLM. Afterward, the audio-visual compression representation and the N-Best hypothesis together constitute a Cross-modal Prompt to guide the LLM in producing the best transcription. In addition, we also proposed a Multi-Level Consistency Constraint training criterion, including logits-level, utterance-level and representations-level, to improve the correction accuracy while enhancing the interpretability of audio and visual compression representations. The experimental results on the LRS3 dataset show that our method outperforms current mainstream AVSR systems. The proposed AVGER can reduce the Word Error Rate (WER) by 24% compared to them. Code and models can be found at: https://github.com/CircleRedRain/AVGER.
中文:提出的AVGER方法通过多模态编码器将音频和视觉输入转化为大语言模型可理解的表征,并结合N最佳假设构成跨模态提示,在LRS3数据集上实现了24%的词错误率降低,提升了视听语音识别性能。
English: The proposed AVGER method enhances Audio-Visual Speech Recognition by using a multimodal encoder to convert audio and visual inputs into LLM-understandable representations, combined with N-best hypotheses in a cross-modal prompt, achieving a 24% reduction in Word Error Rate on the LRS3 dataset.

Authors:Satchel French, Faith Zhu, Amish Jain, Naimul Khan
Title: Temporal Feature Weaving for Neonatal Echocardiographic Viewpoint Video Classification
Abstract:
Automated viewpoint classification in echocardiograms can help under-resourced clinics and hospitals in providing faster diagnosis and screening when expert technicians may not be available. We propose a novel approach towards echocardiographic viewpoint classification. We show that treating viewpoint classification as video classification rather than image classification yields advantage. We propose a CNN-GRU architecture with a novel temporal feature weaving method, which leverages both spatial and temporal information to yield a 4.33\% increase in accuracy over baseline image classification while using only four consecutive frames. The proposed approach incurs minimal computational overhead. Additionally, we publish the Neonatal Echocardiogram Dataset (NED), a professionally-annotated dataset providing sixteen viewpoints and associated echocardipgraphy videos to encourage future work and development in this field. Code available at: https://github.com/satchelfrench/NED
中文摘要:本研究提出了一种结合CNN-GRU架构与时序特征融合的新方法,通过视频分类实现超声心动图视角识别,准确率较基线提升4.33%,并发布了带专业标注的新生儿超声心动图数据集以推动相关研究。
English Summary: This study introduces a novel CNN-GRU architecture with temporal feature weaving for echocardiographic viewpoint classification, achieving a 4.33% accuracy improvement over baseline methods while publishing the annotated Neonatal Echocardiogram Dataset to advance the field.

Authors:Xiangrui Meng, Ying Tan
Title: A GPU Implementation of Multi-Guiding Spark Fireworks Algorithm for Efficient Black-Box Neural Network Optimization
Abstract:
Swarm intelligence optimization algorithms have gained significant attention due to their ability to solve complex optimization problems. However, the efficiency of optimization in large-scale problems limits the use of related methods. This paper presents a GPU-accelerated version of the Multi-Guiding Spark Fireworks Algorithm (MGFWA), which significantly improves the computational efficiency compared to its traditional CPU-based counterpart. We benchmark the GPU-MGFWA on several neural network black-box optimization problems and demonstrate its superior performance in terms of both speed and solution quality. By leveraging the parallel processing power of modern GPUs, the proposed GPU-MGFWA results in faster convergence and reduced computation time for large-scale optimization tasks. The proposed implementation offers a promising approach to accelerate swarm intelligence algorithms, making them more suitable for real-time applications and large-scale industrial problems. Source code is released at https://github.com/mxxxr/MGFWA.
中文: 本文提出了一种GPU加速的多引导火花烟花算法,相比传统CPU方法显著提高了大规模优化问题的计算效率和求解质量。
English: This paper introduces a GPU-accelerated Multi-Guiding Spark Fireworks Algorithm that significantly enhances computational efficiency and solution quality for large-scale optimization problems, outperforming traditional CPU-based methods.

Authors:Hao Zheng, Xinyan Guan, Hao Kong, Jia Zheng, Weixiang Zhou, Hongyu Lin, Yaojie Lu, Ben He, Xianpei Han, Le Sun
Title: PPTAgent: Generating and Evaluating Presentations Beyond Text-to-Slides
Abstract:
Automatically generating presentations from documents is a challenging task that requires accommodating content quality, visual appeal, and structural coherence. Existing methods primarily focus on improving and evaluating the content quality in isolation, overlooking visual appeal and structural coherence, which limits their practical applicability. To address these limitations, we propose PPTAgent, which comprehensively improves presentation generation through a two-stage, edit-based approach inspired by human workflows. PPTAgent first analyzes reference presentations to extract slide-level functional types and content schemas, then drafts an outline and iteratively generates editing actions based on selected reference slides to create new slides. To comprehensively evaluate the quality of generated presentations, we further introduce PPTEval, an evaluation framework that assesses presentations across three dimensions: Content, Design, and Coherence. Results demonstrate that PPTAgent significantly outperforms existing automatic presentation generation methods across all three dimensions.
中文摘要:PPTAgent采用基于编辑的两阶段方法,通过分析参考幻灯片并迭代生成新幻灯片,在内容、设计和连贯性三个维度上均显著优于现有自动演示文稿生成方法,并通过PPTEval框架进行全面评估。
English Summary: PPTAgent is a two-stage, edit-based approach that enhances presentation generation by analyzing reference slides and iteratively creating new ones, significantly outperforming existing methods in content, design, and coherence as evaluated by the PPTEval framework.

Authors:Yuechen Zhang, Yaoyang Liu, Bin Xia, Bohao Peng, Zexin Yan, Eric Lo, Jiaya Jia
Title: Magic Mirror: ID-Preserved Video Generation in Video Diffusion Transformers
Abstract:
We present Magic Mirror, a framework for generating identity-preserved videos with cinematic-level quality and dynamic motion. While recent advances in video diffusion models have shown impressive capabilities in text-to-video generation, maintaining consistent identity while producing natural motion remains challenging. Previous methods either require person-specific fine-tuning or struggle to balance identity preservation with motion diversity. Built upon Video Diffusion Transformers, our method introduces three key components: (1) a dual-branch facial feature extractor that captures both identity and structural features, (2) a lightweight cross-modal adapter with Conditioned Adaptive Normalization for efficient identity integration, and (3) a two-stage training strategy combining synthetic identity pairs with video data. Extensive experiments demonstrate that Magic Mirror effectively balances identity consistency with natural motion, outperforming existing methods across multiple metrics while requiring minimal parameters added. The code and model will be made publicly available at: https://github.com/dvlab-research/MagicMirror/
中文: Magic Mirror 是一种创新框架,通过双分支面部特征提取器、轻量级跨模态适配器和两阶段训练策略,生成具有动态运动且保持身份一致的高质量视频,在身份保持与运动多样性平衡方面优于现有方法。
English: Magic Mirror is a novel framework that generates high-quality, identity-preserved videos with dynamic motion by integrating a dual-branch facial feature extractor, a lightweight cross-modal adapter, and a two-stage training strategy, outperforming existing methods in balancing identity consistency and motion diversity.

Authors:Shaolei Zhang, Qingkai Fang, Zhe Yang, Yang Feng
Title: LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token
Abstract:
The advent of real-time large multimodal models (LMMs) like GPT-4o has sparked considerable interest in efficient LMMs. LMM frameworks typically encode visual inputs into vision tokens (continuous representations) and integrate them and textual instructions into the context of large language models (LLMs), where large-scale parameters and numerous context tokens (predominantly vision tokens) result in substantial computational overhead. Previous efforts towards efficient LMMs always focus on replacing the LLM backbone with smaller models, while neglecting the crucial issue of token quantity. In this paper, we introduce LLaVA-Mini, an efficient LMM with minimal vision tokens. To achieve a high compression ratio of vision tokens while preserving visual information, we first analyze how LMMs understand vision tokens and find that most vision tokens only play a crucial role in the early layers of LLM backbone, where they mainly fuse visual information into text tokens. Building on this finding, LLaVA-Mini introduces modality pre-fusion to fuse visual information into text tokens in advance, thereby facilitating the extreme compression of vision tokens fed to LLM backbone into one token. LLaVA-Mini is a unified large multimodal model that can support the understanding of images, high-resolution images, and videos in an efficient manner. Experiments across 11 image-based and 7 video-based benchmarks demonstrate that LLaVA-Mini outperforms LLaVA-v1.5 with just 1 vision token instead of 576. Efficiency analyses reveal that LLaVA-Mini can reduce FLOPs by 77%, deliver low-latency responses within 40 milliseconds, and process over 10,000 frames of video on the GPU hardware with 24GB of memory.
中文: LLaVA-Mini是一种高效的大型多模态模型,通过模态预融合将视觉令牌压缩至仅一个,在显著降低计算开销的同时,保持了在图像和视频基准测试中的优异性能。
English: LLaVA-Mini is an efficient large multimodal model that drastically reduces computational overhead by compressing vision tokens to just one through modality pre-fusion, while maintaining high performance across image and video benchmarks.

Authors:Zekai Gu, Rui Yan, Jiahao Lu, Peng Li, Zhiyang Dou, Chenyang Si, Zhen Dong, Qifeng Liu, Cheng Lin, Ziwei Liu, Wenping Wang, Yuan Liu
Title: Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control
Abstract:
Diffusion models have demonstrated impressive performance in generating high-quality videos from text prompts or images. However, precise control over the video generation process, such as camera manipulation or content editing, remains a significant challenge. Existing methods for controlled video generation are typically limited to a single control type, lacking the flexibility to handle diverse control demands. In this paper, we introduce Diffusion as Shader (DaS), a novel approach that supports multiple video control tasks within a unified architecture. Our key insight is that achieving versatile video control necessitates leveraging 3D control signals, as videos are fundamentally 2D renderings of dynamic 3D content. Unlike prior methods limited to 2D control signals, DaS leverages 3D tracking videos as control inputs, making the video diffusion process inherently 3D-aware. This innovation allows DaS to achieve a wide range of video controls by simply manipulating the 3D tracking videos. A further advantage of using 3D tracking videos is their ability to effectively link frames, significantly enhancing the temporal consistency of the generated videos. With just 3 days of fine-tuning on 8 H800 GPUs using less than 10k videos, DaS demonstrates strong control capabilities across diverse tasks, including mesh-to-video generation, camera control, motion transfer, and object manipulation.
中文:Diffusion as Shader (DaS) 提出了一种以三维追踪视频作为控制输入的通用架构,能够实现相机控制和物体操纵等多种视频生成任务,同时以极少的训练量确保时间连贯性。
English: Diffusion as Shader (DaS) introduces a unified architecture using 3D tracking videos as control inputs, enabling versatile video generation tasks like camera control and object manipulation while ensuring temporal consistency with minimal training.

Authors:Yindu Su, Huike Zou, Lin Sun, Ting Zhang, Haiyang Yang, Liyu Chen, David Lo, Qingheng Zhang, Shuguang Han, Jufeng Chen
Title: TACLR: A Scalable and Efficient Retrieval-based Method for Industrial Product Attribute Value Identification
Abstract:
Product Attribute Value Identification (PAVI) involves identifying attribute values from product profiles, a key task for improving product search, recommendation, and business analytics on e-commerce platforms. However, existing PAVI methods face critical challenges, such as inferring implicit values, handling out-of-distribution (OOD) values, and producing normalized outputs. To address these limitations, we introduce Taxonomy-Aware Contrastive Learning Retrieval (TACLR), the first retrieval-based method for PAVI. TACLR formulates PAVI as an information retrieval task by encoding product profiles and candidate values into embeddings and retrieving values based on their similarity. It leverages contrastive training with taxonomy-aware hard negative sampling and employs adaptive inference with dynamic thresholds. TACLR offers three key advantages: (1) it effectively handles implicit and OOD values while producing normalized outputs; (2) it scales to thousands of categories, tens of thousands of attributes, and millions of values; and (3) it supports efficient inference for high-load industrial deployment. Extensive experiments on proprietary and public datasets validate the effectiveness and efficiency of TACLR. Further, it has been successfully deployed on the real-world e-commerce platform Xianyu, processing millions of product listings daily with frequently updated, large-scale attribute taxonomies. We release the code to facilitate reproducibility and future research at https://github.com/SuYindu/TACLR.
中文: TACLR是一种基于检索的产品属性值识别新方法,通过对比学习和自适应推理有效解决了隐式值和分布外值等难题,实现了在电商平台上的可扩展高效部署。
English: TACLR is a novel retrieval-based method for product attribute value identification that overcomes challenges like implicit and out-of-distribution values through contrastive learning and adaptive inference, enabling scalable and efficient deployment on e-commerce platforms.

Authors:Zhe Li, Man-wai Mak, Mert Pilanci, Hung-yi Lee, Helen Meng
Title: Spectral-Aware Low-Rank Adaptation for Speaker Verification
Abstract:
Previous research has shown that the principal singular vectors of a pre-trained model's weight matrices capture critical knowledge. In contrast, those associated with small singular values may contain noise or less reliable information. As a result, the LoRA-based parameter-efficient fine-tuning (PEFT) approach, which does not constrain the use of the spectral space, may not be effective for tasks that demand high representation capacity. In this study, we enhance existing PEFT techniques by incorporating the spectral information of pre-trained weight matrices into the fine-tuning process. We investigate spectral adaptation strategies with a particular focus on the additive adjustment of top singular vectors. This is accomplished by applying singular value decomposition (SVD) to the pre-trained weight matrices and restricting the fine-tuning within the top spectral space. Extensive speaker verification experiments on VoxCeleb1 and CN-Celeb1 demonstrate enhanced tuning performance with the proposed approach. Code is released at https://github.com/lizhepolyu/SpectralFT.
中文摘要:本研究通过融入预训练模型的频谱信息并重点调整主要奇异向量,改进了参数高效微调方法,在说话人验证任务中显著提升了性能。
English Summary: This study improves parameter-efficient fine-tuning by incorporating spectral information from pre-trained models, specifically adjusting top singular vectors to enhance performance in speaker verification tasks.

Authors:Jiayao Gu, Liting Chen, Yihong Li
Title: Investigating the Impact of Data Selection Strategies on Language Model Performance
Abstract:
Data selection is critical for enhancing the performance of language models, particularly when aligning training datasets with a desired target distribution. This study explores the effects of different data selection methods and feature types on model performance. We evaluate whether selecting data subsets can influence downstream tasks, whether n-gram features improve alignment with target distributions, and whether embedding-based neural features provide complementary benefits. Through comparative experiments using baseline random selection methods and distribution aligned approaches, we provide insights into the interplay between data selection strategies and model training efficacy. All code for this study can be found on \href{https://github.com/jgu13/HIR-Hybrid-Importance-Resampling-for-Language-Models}{github repository}.
Chinese: 本研究通过对比实验探讨不同数据选择方法和特征类型如何影响语言模型性能,分析其与目标分布的匹配程度及对下游任务的作用。
English: This study investigates how various data selection methods and feature types impact language model performance, examining their alignment with target distributions and effects on downstream tasks through comparative experiments.

Authors:Eduarda Caldeira, Guray Ozgur, Tahar Chettaoui, Marija Ivanovska, Peter Peer, Fadi Boutros, Vitomir Struc, Naser Damer
Title: MADation: Face Morphing Attack Detection with Foundation Models
Abstract:
Despite the considerable performance improvements of face recognition algorithms in recent years, the same scientific advances responsible for this progress can also be used to create efficient ways to attack them, posing a threat to their secure deployment. Morphing attack detection (MAD) systems aim to detect a specific type of threat, morphing attacks, at an early stage, preventing them from being considered for verification in critical processes. Foundation models (FM) learn from extensive amounts of unlabelled data, achieving remarkable zero-shot generalization to unseen domains. Although this generalization capacity might be weak when dealing with domain-specific downstream tasks such as MAD, FMs can easily adapt to these settings while retaining the built-in knowledge acquired during pre-training. In this work, we recognize the potential of FMs to perform well in the MAD task when properly adapted to its specificities. To this end, we adapt FM CLIP architectures with LoRA weights while simultaneously training a classification header. The proposed framework, MADation surpasses our alternative FM and transformer-based frameworks and constitutes the first adaption of FMs to the MAD task. MADation presents competitive results with current MAD solutions in the literature and even surpasses them in several evaluation scenarios. To encourage reproducibility and facilitate further research in MAD, we publicly release the implementation of MADation at https://github.com/gurayozgur/MADation
中文: 尽管人脸识别技术的进步带来了新的攻击风险,本研究提出的MADation框架通过LoRA适配基础模型,在篡改攻击检测任务中展现出卓越性能,在多个评估场景下达到甚至超越了现有解决方案的水平。
English: While face recognition advances also enable new attack methods, this work introduces MADation, a framework adapting foundation models with LoRA to effectively detect morphing attacks, achieving competitive and even superior performance in various scenarios.

Authors:Xinbin Yuan, Zhaohui Zheng, Yuxuan Li, Xialei Liu, Li Liu, Xiang Li, Qibin Hou, Ming-Ming Cheng
Title: Strip R-CNN: Large Strip Convolution for Remote Sensing Object Detection
Abstract:
While witnessed with rapid development, remote sensing object detection remains challenging for detecting high aspect ratio objects. This paper shows that large strip convolutions are good feature representation learners for remote sensing object detection and can detect objects of various aspect ratios well. Based on large strip convolutions, we build a new network architecture called Strip R-CNN, which is simple, efficient, and powerful. Unlike recent remote sensing object detectors that leverage large-kernel convolutions with square shapes, our Strip R-CNN takes advantage of sequential orthogonal large strip convolutions in our backbone network StripNet to capture spatial information. In addition, we improve the localization capability of remote-sensing object detectors by decoupling the detection heads and equipping the localization branch with strip convolutions in our strip head. Extensive experiments on several benchmarks, for example DOTA, FAIR1M, HRSC2016, and DIOR, show that our Strip R-CNN can greatly improve previous work. In particular, our 30M model achieves 82.75% mAP on DOTA-v1.0, setting a new state-of-the-art record. Our code will be made publicly available.Code is available at https://github.com/YXB-NKU/Strip-R-CNN.
中文摘要:本文提出Strip R-CNN网络架构,采用大型条带卷积有效检测遥感图像中不同长宽比的物体,在多个基准测试中实现了最先进的性能。
English Summary: This paper introduces Strip R-CNN, a novel network architecture using large strip convolutions to effectively detect objects of various aspect ratios in remote sensing, achieving state-of-the-art performance on multiple benchmarks.

Authors:Maxime Zanella, Clément Fuchs, Christophe De Vleeschouwer, Ismail Ben Ayed
Title: Realistic Test-Time Adaptation of Vision-Language Models
Abstract:
The zero-shot capabilities of Vision-Language Models (VLMs) have been widely leveraged to improve predictive performance. However, previous works on transductive or test-time adaptation (TTA) often make strong assumptions about the data distribution, such as the presence of all classes. Our work challenges these favorable deployment scenarios, and introduces a more realistic evaluation framework, including: (i) a variable number of effective classes for adaptation within a single batch, and (ii) non-i.i.d. batches of test samples in online adaptation settings. We provide comprehensive evaluations, comparisons, and ablation studies that demonstrate how current transductive or TTA methods for VLMs systematically compromise the models' initial zero-shot robustness across various realistic scenarios, favoring performance gains under advantageous assumptions about the test samples' distributions. Furthermore, we introduce StatA, a versatile method that could handle a wide range of deployment scenarios, including those with a variable number of effective classes at test time. Our approach incorporates a novel regularization term designed specifically for VLMs, which acts as a statistical anchor preserving the initial text-encoder knowledge, particularly in low-data regimes. Code available at https://github.com/MaxZanella/StatA.
中文: 本研究挑战了视觉语言模型中转导学习和测试时适应方法的理想假设,提出了包含可变类别数和非独立同分布测试批次的现实评估框架,同时引入StatA方法——通过新型正则化项在多样化部署场景中保持模型初始鲁棒性的通用解决方案。
English: This study challenges the favorable assumptions in transductive and test-time adaptation methods for Vision-Language Models by introducing a realistic evaluation framework with variable class numbers and non-i.i.d. test batches, while proposing StatA—a versatile method with a novel regularization term that preserves initial model robustness across diverse deployment scenarios.

Authors:Avishai Elmakies, Omri Abend, Yossi Adi
Title: Unsupervised Speech Segmentation: A General Approach Using Speech Language Models
Abstract:
In this paper, we introduce an unsupervised approach for Speech Segmentation, which builds on previously researched approaches, e.g., Speaker Diarization, while being applicable to an inclusive set of acoustic-semantic distinctions, paving a path towards a general Unsupervised Speech Segmentation approach. Unlike traditional speech and audio segmentation, which mainly focuses on spectral changes in the input signal, e.g., phone segmentation, our approach tries to segment the spoken utterance into chunks with differing acoustic-semantic styles, focusing on acoustic-semantic information that does not translate well into text, e.g., emotion or speaker. While most Speech Segmentation tasks only handle one style change, e.g., emotion diarization, our approach tries to handle multiple acoustic-semantic style changes. Leveraging recent advances in Speech Language Models (SLMs), we propose a simple unsupervised method to segment a given speech utterance. We empirically demonstrate the effectiveness of the proposed approach by considering several setups. Results suggest that the proposed method is superior to the evaluated baselines on boundary detection, segment purity, and over-segmentation. Code is available at https://github.com/avishaiElmakies/unsupervised_speech_segmentation_using_slm.
中文: 本文提出了一种无监督语音分割方法,利用语音语言模型处理多种声学语义风格变化,在边界检测和分段纯度方面优于基线方法。
English: This paper presents an unsupervised speech segmentation method that leverages Speech Language Models to handle multiple acoustic-semantic style changes, outperforming baselines in boundary detection and segment purity.

Authors:Mengshi Qi, Hao Ye, Jiaxuan Peng, Huadong Ma
Title: Action Quality Assessment via Hierarchical Pose-guided Multi-stage Contrastive Regression
Abstract:
Action Quality Assessment (AQA), which aims at automatic and fair evaluation of athletic performance, has gained increasing attention in recent years. However, athletes are often in rapid movement and the corresponding visual appearance variances are subtle, making it challenging to capture fine-grained pose differences and leading to poor estimation performance. Furthermore, most common AQA tasks, such as diving in sports, are usually divided into multiple sub-actions, each of which contains different durations. However, existing methods focus on segmenting the video into fixed frames, which disrupts the temporal continuity of sub-actions resulting in unavoidable prediction errors. To address these challenges, we propose a novel action quality assessment method through hierarchically pose-guided multi-stage contrastive regression. Firstly, we introduce a multi-scale dynamic visual-skeleton encoder to capture fine-grained spatio-temporal visual and skeletal features. Then, a procedure segmentation network is introduced to separate different sub-actions and obtain segmented features. Afterwards, the segmented visual and skeletal features are both fed into a multi-modal fusion module as physics structural priors, to guide the model in learning refined activity similarities and variances. Finally, a multi-stage contrastive learning regression approach is employed to learn discriminative representations and output prediction results. In addition, we introduce a newly-annotated FineDiving-Pose Dataset to improve the current low-quality human pose labels. In experiments, the results on FineDiving and MTL-AQA datasets demonstrate the effectiveness and superiority of our proposed approach. Our source code and dataset are available at https://github.com/Lumos0507/HP-MCoRe.
中文: 本文提出了一种新颖的分层姿态引导多阶段对比回归方法,通过多尺度特征编码和过程分割解决动作质量评估中细粒度姿态变化捕捉和时序连续性的难题,并在新数据集和现有数据集上验证了其有效性。
English: This paper proposes a novel hierarchically pose-guided multi-stage contrastive regression method for action quality assessment, addressing challenges in capturing fine-grained pose variations and temporal continuity through multi-scale feature encoding and procedure segmentation, validated on new and existing datasets.

Authors:Jiaxuan Li, Qing Xu, Xiangjian He, Ziyu Liu, Daokun Zhang, Ruili Wang, Rong Qu, Guoping Qiu
Title: CFFormer: Cross CNN-Transformer Channel Attention and Spatial Feature Fusion for Improved Segmentation of Heterogeneous Medical Images
Abstract:
Medical image segmentation plays an important role in computer-aided diagnosis. Existing methods mainly utilize spatial attention to highlight the region of interest. However, due to limitations of medical imaging devices, medical images exhibit significant heterogeneity, posing challenges for segmentation. Ultrasound images, for instance, often suffer from speckle noise, low resolution, and poor contrast between target tissues and background, which may lead to inaccurate boundary delineation. To address these challenges caused by heterogeneous image quality, we propose a hybrid CNN-Transformer model,called CFFormer, which leverages effective channel feature extraction to enhance the model' s ability to accurately identify tissue regions by capturing rich contextual information. The proposed architecture contains two key components: the Cross Feature Channel Attention (CFCA) module and the X-Spatial Feature Fusion (XFF) module. The model incorporates dual encoders, with the CNN encoder focusing on capturing local features and the Transformer encoder modeling global features. The CFCA module filters and facilitates interactions between the channel features from the two encoders, while the XFF module effectively reduces the significant semantic information differences in spatial features, enabling a smooth and cohesive spatial feature fusion. We evaluate our model across eight datasets covering five modalities to test its generalization capability. Experimental results demonstrate that our model outperforms current state-of-the-art methods and maintains accurate tissue region segmentation across heterogeneous medical image datasets. The code is available at https://github.com/JiaxuanFelix/CFFormer.
中文: 提出的CFFormer模型融合CNN与Transformer架构,通过特殊模块增强通道和空间特征整合,在包含噪声和低对比度等挑战的多种医学影像数据集中实现了更优的分割精度。
English: The proposed CFFormer model combines CNN and Transformer architectures with specialized modules to enhance channel and spatial feature integration, achieving superior segmentation accuracy across diverse medical imaging datasets despite challenges like noise and low contrast.

Authors:Liyue Chen, Jiangyi Fang, Tengfei Liu, Fangyuan Gao, Leye Wang
Title: STContext: A Multifaceted Dataset for Developing Context-aware Spatio-temporal Crowd Mobility Prediction Models
Abstract:
In smart cities, context-aware spatio-temporal crowd flow prediction (STCFP) models leverage contextual features (e.g., weather) to identify unusual crowd mobility patterns and enhance prediction accuracy. However, the best practice for incorporating contextual features remains unclear due to inconsistent usage of contextual features in different papers. Developing a multifaceted dataset with rich types of contextual features and STCFP scenarios is crucial for establishing a principled context modeling paradigm. Existing open crowd flow datasets lack an adequate range of contextual features, which poses an urgent requirement to build a multifaceted dataset to fill these research gaps. To this end, we create STContext, a multifaceted dataset for developing context-aware STCFP models. Specifically, STContext provides nine spatio-temporal datasets across five STCFP scenarios and includes ten contextual features, including weather, air quality index, holidays, points of interest, road networks, etc. Besides, we propose a unified workflow for incorporating contextual features into deep STCFP methods, with steps including feature transformation, dependency modeling, representation fusion, and training strategies. Through extensive experiments, we have obtained several useful guidelines for effective context modeling and insights for future research. The STContext is open-sourced at https://github.com/Liyue-Chen/STContext.
中文: STContext数据集旨在解决现有开放人群流量数据中上下文特征不足的问题,提供了多场景数据和统一的工作流程,以提升预测准确性并为未来研究提供指导。
English: The STContext dataset is introduced to address the lack of diverse contextual features in existing crowd flow prediction models, providing comprehensive data and a unified workflow to improve accuracy and guide future research.

Authors:NVIDIA, :, Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, Daniel Dworakowski, Jiaojiao Fan, Michele Fenzi, Francesco Ferroni, Sanja Fidler, Dieter Fox, Songwei Ge, Yunhao Ge, Jinwei Gu, Siddharth Gururani, Ethan He, Jiahui Huang, Jacob Huffman, Pooya Jannaty, Jingyi Jin, Seung Wook Kim, Gergely Klár, Grace Lam, Shiyi Lan, Laura Leal-Taixe, Anqi Li, Zhaoshuo Li, Chen-Hsuan Lin, Tsung-Yi Lin, Huan Ling, Ming-Yu Liu, Xian Liu, Alice Luo, Qianli Ma, Hanzi Mao, Kaichun Mo, Arsalan Mousavian, Seungjun Nah, Sriharsha Niverty, David Page, Despoina Paschalidou, Zeeshan Patel, Lindsey Pavao, Morteza Ramezanali, Fitsum Reda, Xiaowei Ren, Vasanth Rao Naik Sabavat, Ed Schmerling, Stella Shi, Bartosz Stefaniak, Shitao Tang, Lyne Tchapmi, Przemek Tredak, Wei-Cheng Tseng, Jibin Varghese, Hao Wang, Haoxiang Wang, Heng Wang, Ting-Chun Wang, Fangyin Wei, Xinyue Wei, Jay Zhangjie Wu, Jiashu Xu, Wei Yang, Lin Yen-Chen, Xiaohui Zeng, Yu Zeng, Jing Zhang, Qinsheng Zhang, Yuxuan Zhang, Qingqing Zhao, Artur Zolkowski
Title: Cosmos World Foundation Model Platform for Physical AI
Abstract:
Physical AI needs to be trained digitally first. It needs a digital twin of itself, the policy model, and a digital twin of the world, the world model. In this paper, we present the Cosmos World Foundation Model Platform to help developers build customized world models for their Physical AI setups. We position a world foundation model as a general-purpose world model that can be fine-tuned into customized world models for downstream applications. Our platform covers a video curation pipeline, pre-trained world foundation models, examples of post-training of pre-trained world foundation models, and video tokenizers. To help Physical AI builders solve the most critical problems of our society, we make Cosmos open-source and our models open-weight with permissive licenses available via https://github.com/nvidia-cosmos/cosmos-predict1.
物理AI需要自身和世界的数字孪生,Cosmos平台通过可定制的世界基础模型及开源工具提供支持,以解决社会关键问题。
Physical AI requires digital twins of itself and the world, which the Cosmos platform provides through customizable world foundation models and open-source tools to address societal challenges.

Authors:Zetian Feng, Dong Ni, Yi Wang
Title: Salient Region Matching for Fully Automated MR-TRUS Registration
Abstract:
Prostate cancer is a leading cause of cancer-related mortality in men. The registration of magnetic resonance (MR) and transrectal ultrasound (TRUS) can provide guidance for the targeted biopsy of prostate cancer. In this study, we propose a salient region matching framework for fully automated MR-TRUS registration. The framework consists of prostate segmentation, rigid alignment and deformable registration. Prostate segmentation is performed using two segmentation networks on MR and TRUS respectively, and the predicted salient regions are used for the rigid alignment. The rigidly-aligned MR and TRUS images serve as initialization for the deformable registration. The deformable registration network has a dual-stream encoder with cross-modal spatial attention modules to facilitate multi-modality feature learning, and a salient region matching loss to consider both structure and intensity similarity within the prostate region. Experiments on a public MR-TRUS dataset demonstrate that our method achieves satisfactory registration results, outperforming several cutting-edge methods. The code is publicly available at https://github.com/mock1ngbrd/salient-region-matching.
中文: 本研究提出了一种用于前列腺癌活检中磁共振与经直肠超声图像全自动配准的框架,通过分割、刚性对齐和带跨模态注意力的双流形变网络,在公共数据集上取得了优于现有方法的配准效果。
English: This study introduces a fully automated framework for MR-TRUS registration in prostate cancer biopsy, utilizing segmentation, rigid alignment, and a dual-stream deformable network with cross-modal attention to achieve superior results over existing methods.

Authors:Fatemeh Ghofrani, Pooyan Jamshidi
Title: An Empirical Study of Accuracy-Robustness Tradeoff and Training Efficiency in Self-Supervised Learning
Abstract:
Self-supervised learning (SSL) has significantly advanced image representation learning, yet efficiency challenges persist, particularly with adversarial training. Many SSL methods require extensive epochs to achieve convergence, a demand further amplified in adversarial settings. To address this inefficiency, we revisit the robust EMP-SSL framework, emphasizing the importance of increasing the number of crops per image to accelerate learning. Unlike traditional contrastive learning, robust EMP-SSL leverages multi-crop sampling, integrates an invariance term and regularization, and reduces training epochs, enhancing time efficiency. Evaluated with both standard linear classifiers and multi-patch embedding aggregation, robust EMP-SSL provides new insights into SSL evaluation strategies. Our results show that robust crop-based EMP-SSL not only accelerates convergence but also achieves a superior balance between clean accuracy and adversarial robustness, outperforming multi-crop embedding aggregation. Additionally, we extend this approach with free adversarial training in Multi-Crop SSL, introducing the Cost-Free Adversarial Multi-Crop Self-Supervised Learning (CF-AMC-SSL) method. CF-AMC-SSL demonstrates the effectiveness of free adversarial training in reducing training time while simultaneously improving clean accuracy and adversarial robustness. These findings underscore the potential of CF-AMC-SSL for practical SSL applications. Our code is publicly available at https://github.com/softsys4ai/CF-AMC-SSL.
中文摘要:鲁棒EMP-SSL框架通过增加图像裁剪数量和减少训练周期来加速自监督学习,而扩展的CF-AMC-SSL方法通过免费对抗训练进一步提升了效率与鲁棒性。
English Summary: The robust EMP-SSL framework accelerates self-supervised learning by increasing image crops and reducing training epochs, while the extended CF-AMC-SSL method further enhances efficiency and robustness through free adversarial training.

Authors:Nandan Kumar Jha, Brandon Reagen
Title: Entropy-Guided Attention for Private LLMs
Abstract:
The pervasiveness of proprietary language models has raised critical privacy concerns, necessitating advancements in private inference (PI), where computations are performed directly on encrypted data without revealing users' sensitive information. While PI offers a promising solution, its practical deployment is hindered by substantial communication and latency overheads, primarily stemming from nonlinear operations. To address this, we introduce an information-theoretic framework to characterize the role of nonlinearities in decoder-only language models, laying a principled foundation for optimizing transformer-architectures tailored to the demands of PI. By leveraging Shannon's entropy as a quantitative measure, we uncover the previously unexplored dual significance of nonlinearities: beyond ensuring training stability, they are crucial for maintaining attention head diversity. Specifically, we find that their removal triggers two critical failure modes: {\em entropy collapse} in deeper layers that destabilizes training, and {\em entropic overload} in earlier layers that leads to under-utilization of Multi-Head Attention's (MHA) representational capacity. We propose an entropy-guided attention mechanism paired with a novel entropy regularization technique to mitigate entropic overload. Additionally, we explore PI-friendly alternatives to layer normalization for preventing entropy collapse and stabilizing the training of LLMs with reduced-nonlinearities. Our study bridges the gap between information theory and architectural design, establishing entropy dynamics as a principled guide for developing efficient PI architectures. The code and implementation are available at https://github.com/Nandan91/entropy-guided-attention-llm
中文摘要:本研究提出了一种信息论框架,通过分析非线性在Transformer模型中的双重作用来解决私有推理中的通信和延迟挑战,并引入熵引导机制在保持模型稳定性和注意力多样性的同时优化架构设计。
English Summary: This study introduces an information-theoretic framework to address the communication and latency challenges in private inference by analyzing nonlinearities' dual role in transformer models, proposing entropy-guided mechanisms to optimize architectures while maintaining model stability and attention diversity.

Authors:Chuang Niu, Wenjun Xia, Hongming Shan, Ge Wang
Title: Information-Maximized Soft Variable Discretization for Self-Supervised Image Representation Learning
Abstract:
Self-supervised learning (SSL) has emerged as a crucial technique in image processing, encoding, and understanding, especially for developing today's vision foundation models that utilize large-scale datasets without annotations to enhance various downstream tasks. This study introduces a novel SSL approach, Information-Maximized Soft Variable Discretization (IMSVD), for image representation learning. Specifically, IMSVD softly discretizes each variable in the latent space, enabling the estimation of their probability distributions over training batches and allowing the learning process to be directly guided by information measures. Motivated by the MultiView assumption, we propose an information-theoretic objective function to learn transform-invariant, non-travail, and redundancy-minimized representation features. We then derive a joint-cross entropy loss function for self-supervised image representation learning, which theoretically enjoys superiority over the existing methods in reducing feature redundancy. Notably, our non-contrastive IMSVD method statistically performs contrastive learning. Extensive experimental results demonstrate the effectiveness of IMSVD on various downstream tasks in terms of both accuracy and efficiency. Thanks to our variable discretization, the embedding features optimized by IMSVD offer unique explainability at the variable level. IMSVD has the potential to be adapted to other learning paradigms. Our code is publicly available at https://github.com/niuchuangnn/IMSVD.
中文: 本研究提出信息最大化软变量离散化(IMSVD)这一新型自监督学习方法,通过对潜在变量进行软离散化来减少冗余并增强图像表征的可解释性,在无需标注的情况下于多项任务中展现出卓越性能。
English: This study introduces Information-Maximized Soft Variable Discretization (IMSVD), a novel self-supervised learning method that discretizes latent variables to minimize redundancy and enhance explainability in image representation, demonstrating superior performance across various tasks without requiring annotations.

Authors:Yannis Katsis, Sara Rosenthal, Kshitij Fadnis, Chulaka Gunasekara, Young-Suk Lee, Lucian Popa, Vraj Shah, Huaiyu Zhu, Danish Contractor, Marina Danilevsky
Title: MTRAG: A Multi-Turn Conversational Benchmark for Evaluating Retrieval-Augmented Generation Systems
Abstract:
Retrieval-augmented generation (RAG) has recently become a very popular task for Large Language Models (LLMs). Evaluating them on multi-turn RAG conversations, where the system is asked to generate a response to a question in the context of a preceding conversation is an important and often overlooked task with several additional challenges. We present MTRAG: an end-to-end human-generated multi-turn RAG benchmark that reflects several real-world properties across diverse dimensions for evaluating the full RAG pipeline. MTRAG contains 110 conversations averaging 7.7 turns each across four domains for a total of 842 tasks. We also explore automation paths via synthetic data and LLM-as-a-Judge evaluation. Our human and automatic evaluations show that even state-of-the-art LLM RAG systems struggle on MTRAG. We demonstrate the need for strong retrieval and generation systems that can handle later turns, unanswerable questions, non-standalone questions, and multiple domains. MTRAG is available at https://github.com/ibm/mt-rag-benchmark.
中文: MTRAG是一个人工构建的多轮RAG评测基准,揭示了当前最先进的大语言模型在处理跨领域复杂对话场景时面临的挑战。
English: MTRAG is a human-generated multi-turn RAG benchmark that reveals the challenges state-of-the-art LLMs face in handling complex conversational contexts across diverse domains.

Authors:Xiao Wang, Fuling Wang, Haowen Wang, Bo Jiang, Chuanfu Li, Yaowei Wang, Yonghong Tian, Jin Tang
Title: Activating Associative Disease-Aware Vision Token Memory for LLM-Based X-ray Report Generation
Abstract:
X-ray image based medical report generation achieves significant progress in recent years with the help of the large language model, however, these models have not fully exploited the effective information in visual image regions, resulting in reports that are linguistically sound but insufficient in describing key diseases. In this paper, we propose a novel associative memory-enhanced X-ray report generation model that effectively mimics the process of professional doctors writing medical reports. It considers both the mining of global and local visual information and associates historical report information to better complete the writing of the current report. Specifically, given an X-ray image, we first utilize a classification model along with its activation maps to accomplish the mining of visual regions highly associated with diseases and the learning of disease query tokens. Then, we employ a visual Hopfield network to establish memory associations for disease-related tokens, and a report Hopfield network to retrieve report memory information. This process facilitates the generation of high-quality reports based on a large language model and achieves state-of-the-art performance on multiple benchmark datasets, including the IU X-ray, MIMIC-CXR, and Chexpert Plus. The source code of this work is released on \url{https://github.com/Event-AHU/Medical_Image_Analysis}.
中文摘要:本文提出了一种联想记忆增强的X光报告生成模型,通过结合全局与局部视觉信息并关联历史报告数据来模拟医生诊断过程,在多个医学影像基准测试中实现了最优性能。
English Summary: This paper introduces an associative memory-enhanced X-ray report generation model that mimics doctors' diagnostic processes by integrating global and local visual information with historical report data, achieving state-of-the-art performance on multiple medical imaging benchmarks.

Authors:Xuyang Wang, Ziang Cheng, Zhenyu Li, Jiayu Yang, Haorui Ji, Pan Ji, Mehrtash Harandi, Richard Hartley, Hongdong Li
Title: DoubleDiffusion: Combining Heat Diffusion with Denoising Diffusion for Texture Generation on 3D Meshes
Abstract:
This paper addresses the problem of generating textures for 3D mesh assets. Existing approaches often rely on image diffusion models to generate multi-view image observations, which are then transformed onto the mesh surface to produce a single texture. However, due to the gap between multi-view images and 3D space, such process is susceptible to arange of issues such as geometric inconsistencies, visibility occlusion, and baking artifacts. To overcome this problem, we propose a novel approach that directly generates texture on 3D meshes. Our approach leverages heat dissipation diffusion, which serves as an efficient operator that propagates features on the geometric surface of a mesh, while remaining insensitive to the specific layout of the wireframe. By integrating this technique into a generative diffusion pipeline, we significantly improve the efficiency of texture generation compared to existing texture generation methods. We term our approach DoubleDiffusion, as it combines heat dissipation diffusion with denoising diffusion to enable native generative learning on 3D mesh surfaces.
中文: 本文提出DoubleDiffusion方法,通过结合热耗散扩散和去噪扩散直接在三维网格表面生成纹理,有效解决了传统多视图图像方法存在的几何不一致性问题,并显著提升了纹理生成效率。
English: This paper introduces DoubleDiffusion, a novel method that directly generates textures on 3D meshes by combining heat dissipation diffusion with denoising diffusion, overcoming geometric inconsistencies and improving efficiency compared to traditional multi-view image approaches.

Authors:Pengwei Tang, Xiaolin Hu, Yong Liu
Title: ADePT: Adaptive Decomposed Prompt Tuning for Parameter-Efficient Fine-tuning
Abstract:
Prompt Tuning (PT) enables the adaptation of Pre-trained Large Language Models (PLMs) to downstream tasks by optimizing a small amount of soft virtual tokens, which are prepended to the input token embeddings. Recently, Decomposed Prompt Tuning (DePT) has demonstrated superior adaptation capabilities by decomposing the soft prompt into a shorter soft prompt and a pair of low-rank matrices. The product of the pair of low-rank matrices is added to the input token embeddings to offset them. Additionally, DePT achieves faster inference compared to PT due to the shorter soft prompt. However, in this paper, we find that the position-based token embedding offsets of DePT restrict its ability to generalize across diverse model inputs, and that the shared embedding offsets across many token embeddings result in sub-optimization. To tackle these issues, we introduce Adaptive Decomposed Prompt Tuning (ADePT), which is composed of a short soft prompt and a shallow token-shared feed-forward neural network. ADePT utilizes the token-shared feed-forward neural network to learn the embedding offsets for each token, enabling adaptive embedding offsets that vary according to the model input and better optimization of token embedding offsets. This enables ADePT to achieve superior adaptation performance without requiring more inference time or additional trainable parameters compared to vanilla PT and its variants. In comprehensive experiments across 23 natural language processing tasks and 4 typical PLMs of different scales, ADePT consistently surpasses the other leading parameter-efficient fine-tuning methods, and even outperforms the full fine-tuning in certain scenarios. We also provide a theoretical analysis towards ADePT. Code is available at https://github.com/HungerPWAY/ADePT.
中文摘要:自适应分解提示调优(ADePT)通过采用令牌共享前馈神经网络生成自适应嵌入偏移,在多种自然语言处理任务中实现了卓越性能,且相比现有方法无需增加推理时间或可训练参数。
English Summary: Adaptive Decomposed Prompt Tuning (ADePT) enhances prompt tuning by using a token-shared feed-forward network to generate adaptive embedding offsets, achieving superior performance across diverse NLP tasks without increasing inference time or parameters compared to existing methods.

Authors:Liyang Qin, Xiaoli Wang, Chunhua Yang, Huaiwen Zou, Haochuan Zhang
Title: Sensorformer: Cross-patch attention with global-patch compression is effective for high-dimensional multivariate time series forecasting
Abstract:
Among the existing Transformer-based multivariate time series forecasting methods, iTransformer, which treats each variable sequence as a token and only explicitly extracts cross-variable dependencies, and PatchTST, which adopts a channel-independent strategy and only explicitly extracts cross-time dependencies, both significantly outperform most Channel-Dependent Transformer that simultaneously extract cross-time and cross-variable dependencies. This indicates that existing Transformer-based multivariate time series forecasting methods still struggle to effectively fuse these two types of information. We attribute this issue to the dynamic time lags in the causal relationships between different variables. Therefore, we propose a new multivariate time series forecasting Transformer, Sensorformer, which first compresses the global patch information and then simultaneously extracts cross-variable and cross-time dependencies from the compressed representations. Sensorformer can effectively capture the correct inter-variable correlations and causal relationships, even in the presence of dynamic causal lags between variables, while also reducing the computational complexity of pure cross-patch self-attention from $O(D^2 \cdot Patch\_num^2 \cdot d\_model)$ to $O(D^2 \cdot Patch\_num \cdot d\_model)$. Extensive comparative and ablation experiments on 9 mainstream real-world multivariate time series forecasting datasets demonstrate the superiority of Sensorformer. The implementation of Sensorformer, following the style of the Time-series-library and scripts for reproducing the main results, is publicly available at https://github.com/BigYellowTiger/Sensorformer
中文: Sensorformer是一种新型Transformer模型,通过压缩全局片段信息有效捕捉多元时间序列预测中的跨变量和跨时间依赖关系,解决了动态因果滞后问题并降低了计算复杂度。
English: Sensorformer is a novel Transformer model that effectively captures cross-variable and cross-time dependencies in multivariate time series forecasting by compressing global patch information, addressing dynamic causal lags while reducing computational complexity.

Authors:Haozhen Zhang, Haodong Yue, Xi Xiao, Le Yu, Qing Li, Zhen Ling, Ye Zhang
Title: Revolutionizing Encrypted Traffic Classification with MH-Net: A Multi-View Heterogeneous Graph Model
Abstract:
With the growing significance of network security, the classification of encrypted traffic has emerged as an urgent challenge. Traditional byte-based traffic analysis methods are constrained by the rigid granularity of information and fail to fully exploit the diverse correlations between bytes. To address these limitations, this paper introduces MH-Net, a novel approach for classifying network traffic that leverages multi-view heterogeneous traffic graphs to model the intricate relationships between traffic bytes. The essence of MH-Net lies in aggregating varying numbers of traffic bits into multiple types of traffic units, thereby constructing multi-view traffic graphs with diverse information granularities. By accounting for different types of byte correlations, such as header-payload relationships, MH-Net further endows the traffic graph with heterogeneity, significantly enhancing model performance. Notably, we employ contrastive learning in a multi-task manner to strengthen the robustness of the learned traffic unit representations. Experiments conducted on the ISCX and CIC-IoT datasets for both the packet-level and flow-level traffic classification tasks demonstrate that MH-Net achieves the best overall performance compared to dozens of SOTA methods.
中文: 本文提出MH-Net加密流量分类方法,通过构建多视图异构图捕捉字节间复杂关联并采用对比学习,在基准数据集上实现了最优性能。
English: This paper proposes MH-Net, a novel encrypted traffic classification method that constructs multi-view heterogeneous graphs to capture complex byte relationships and employs contrastive learning, achieving state-of-the-art performance on benchmark datasets.

Authors:Peihai Jiang, Xixiang Lyu, Yige Li, Jing Ma
Title: Backdoor Token Unlearning: Exposing and Defending Backdoors in Pretrained Language Models
Abstract:
Supervised fine-tuning has become the predominant method for adapting large pretrained models to downstream tasks. However, recent studies have revealed that these models are vulnerable to backdoor attacks, where even a small number of malicious samples can successfully embed backdoor triggers into the model. While most existing defense methods focus on post-training backdoor defense, efficiently defending against backdoor attacks during training phase remains largely unexplored. To address this gap, we propose a novel defense method called Backdoor Token Unlearning (BTU), which proactively detects and neutralizes trigger tokens during the training stage. Our work is based on two key findings: 1) backdoor learning causes distinctive differences between backdoor token parameters and clean token parameters in word embedding layers, and 2) the success of backdoor attacks heavily depends on backdoor token parameters. The BTU defense leverages these properties to identify aberrant embedding parameters and subsequently removes backdoor behaviors using a fine-grained unlearning technique. Extensive evaluations across three datasets and four types of backdoor attacks demonstrate that BTU effectively defends against these threats while preserving the model's performance on primary tasks. Our code is available at https://github.com/XDJPH/BTU.
中文摘要:本文提出了一种名为后门令牌遗忘(BTU)的新型防御方法,通过在训练阶段主动检测并消除嵌入层中的异常令牌参数,并采用细粒度遗忘技术来有效防御后门攻击,同时保持模型主要任务的性能。
English Summary: This paper introduces Backdoor Token Unlearning (BTU), a novel defense method that proactively detects and neutralizes backdoor triggers during supervised fine-tuning by identifying aberrant token parameters in embedding layers and applying fine-grained unlearning techniques.

Authors:Qi Wang, Marco Federici, Herke van Hoof
Title: Bridge the Inference Gaps of Neural Processes via Expectation Maximization
Abstract:
The neural process (NP) is a family of computationally efficient models for learning distributions over functions. However, it suffers from under-fitting and shows suboptimal performance in practice. Researchers have primarily focused on incorporating diverse structural inductive biases, \textit{e.g.} attention or convolution, in modeling. The topic of inference suboptimality and an analysis of the NP from the optimization objective perspective has hardly been studied in earlier work. To fix this issue, we propose a surrogate objective of the target log-likelihood of the meta dataset within the expectation maximization framework. The resulting model, referred to as the Self-normalized Importance weighted Neural Process (SI-NP), can learn a more accurate functional prior and has an improvement guarantee concerning the target log-likelihood. Experimental results show the competitive performance of SI-NP over other NPs objectives and illustrate that structural inductive biases, such as attention modules, can also augment our method to achieve SOTA performance. Our code is available at \url{https://github.com/hhq123gogogo/SI_NPs}.
Chinese: 自归一化重要性加权神经过程(SI-NP)通过期望最大化框架中的替代目标优化,解决了神经过程拟合不足和性能欠佳的问题,提升了函数先验学习能力并取得了优越性能。
English: The Self-normalized Importance weighted Neural Process (SI-NP) is introduced to address the under-fitting and suboptimal performance of neural processes by optimizing a surrogate objective within the expectation maximization framework, enhancing functional prior learning and achieving competitive results.

Authors:Jian Hu, Jason Klein Liu, Haotian Xu, Wei Shen
Title: REINFORCE++: An Efficient RLHF Algorithm with Robustness to Both Prompt and Reward Models
Abstract:
Reinforcement Learning from Human Feedback (RLHF) plays a crucial role in aligning large language models (LLMs) with human values and preferences. While state-of-the-art applications like ChatGPT or GPT-4 commonly employ Proximal Policy Optimization (PPO), the inclusion of a critic network introduces significant computational overhead. REINFORCE-based methods, such as REINFORCE Leave One-Out (RLOO), ReMax, and Group Relative Policy Optimization (GRPO), address this limitation by eliminating the critic network. However, these approaches face challenges in accurate advantage estimation. Specifically, they estimate advantages independently for responses to each prompt, which can lead to overfitting on simpler prompts and vulnerability to reward hacking and may be biased. To address these challenges, we introduce REINFORCE++, a novel approach that removes the critic model while using the global advantage normalization which is unbiased to improve the training stability. Our empirical evaluation demonstrates that REINFORCE++ exhibits robust performance across various reward models without requiring prompt set truncation. Furthermore, it achieves superior generalization in both RLHF and long chain-of-thought (CoT) settings compared to existing REINFORCE-based methods. The implementation is available at https://github.com/OpenRLHF/OpenRLHF.
中文: REINFORCE++是一种创新的强化学习方法,它通过移除评论家网络并采用全局优势归一化来提高训练稳定性,在使大语言模型与人类偏好对齐方面展现出鲁棒性能和卓越的泛化能力。
English: REINFORCE++ is a novel reinforcement learning method that eliminates the critic network and employs global advantage normalization to enhance training stability, demonstrating robust performance and superior generalization in aligning large language models with human preferences.

Authors:Thi Thuy Ngan Duong, Duy-Nam Bui, Manh Duong Phung
Title: Navigation Variable-based Multi-objective Particle Swarm Optimization for UAV Path Planning with Kinematic Constraints
Abstract:
Path planning is essential for unmanned aerial vehicles (UAVs) as it determines the path that the UAV needs to follow to complete a task. This work addresses this problem by introducing a new algorithm called navigation variable-based multi-objective particle swarm optimization (NMOPSO). It first models path planning as an optimization problem via the definition of a set of objective functions that include optimality and safety requirements for UAV operation. The NMOPSO is then used to minimize those functions through Pareto optimal solutions. The algorithm features a new path representation based on navigation variables to include kinematic constraints and exploit the maneuverable characteristics of the UAV. It also includes an adaptive mutation mechanism to enhance the diversity of the swarm for better solutions. Comparisons with various algorithms have been carried out to benchmark the proposed approach. The results indicate that the NMOPSO performs better than not only other particle swarm optimization variants but also other state-of-the-art multi-objective and metaheuristic optimization algorithms. Experiments have also been conducted with real UAVs to confirm the validity of the approach for practical flights. The source code of the algorithm is available at https://github.com/ngandng/NMOPSO.
Chinese: 本文提出NMOPSO算法,通过导航变量建模将无人机路径规划转化为多目标优化问题,在仿真和实际飞行中均优于现有先进算法。
English: This paper introduces NMOPSO, a novel algorithm that models UAV path planning as a multi-objective optimization problem and outperforms existing methods in both simulations and real-world flights.

Authors:Guoxuan Chen, Lianghao Xia, Chao Huang
Title: LightGNN: Simple Graph Neural Network for Recommendation
Abstract:
Graph neural networks (GNNs) have demonstrated superior performance in collaborative recommendation through their ability to conduct high-order representation smoothing, effectively capturing structural information within users' interaction patterns. However, existing GNN paradigms face significant challenges in scalability and robustness when handling large-scale, noisy, and real-world datasets. To address these challenges, we present LightGNN, a lightweight and distillation-based GNN pruning framework designed to substantially reduce model complexity while preserving essential collaboration modeling capabilities. Our LightGNN framework introduces a computationally efficient pruning module that adaptively identifies and removes redundant edges and embedding entries for model compression. The framework is guided by a resource-friendly hierarchical knowledge distillation objective, whose intermediate layer augments the observed graph to maintain performance, particularly in high-rate compression scenarios. Extensive experiments on public datasets demonstrate LightGNN's effectiveness, significantly improving both computational efficiency and recommendation accuracy. Notably, LightGNN achieves an 80% reduction in edge count and 90% reduction in embedding entries while maintaining performance comparable to more complex state-of-the-art baselines. The implementation of our LightGNN framework is available at the github repository: https://github.com/HKUDS/LightGNN.
中文摘要:LightGNN是一种轻量级图神经网络框架,通过知识蒸馏自适应剪枝冗余边和嵌入,在保持性能的同时显著提升推荐系统的计算效率与模型压缩率。
English Summary: LightGNN is a lightweight graph neural network framework that enhances collaborative filtering by adaptively pruning redundant edges and embeddings through knowledge distillation, achieving significant model compression with maintained performance.

Authors:Beichen Zhang, Yuhong Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Haodong Duan, Yuhang Cao, Dahua Lin, Jiaqi Wang
Title: BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning
Abstract:
Large language models (LLMs) have demonstrated impressive ability in solving complex mathematical problems with multi-step reasoning and can be further enhanced with well-designed in-context learning (ICL) examples. However, this potential is often constrained by two major challenges in ICL: granularity mismatch and irrelevant information. We observe that while LLMs excel at decomposing mathematical problems, they often struggle with reasoning errors in fine-grained steps. Moreover, ICL examples retrieved at the question level may omit critical steps or even mislead the model with irrelevant details. To address this issue, we propose BoostStep, a method that enhances reasoning accuracy through step-aligned ICL, a novel mechanism that carefully aligns retrieved reference steps with the corresponding reasoning steps. Additionally, BoostStep incorporates an effective "first-try" strategy to deliver exemplars highly relevant to the current state of reasoning. BoostStep is a flexible and powerful method that integrates seamlessly with chain-of-thought (CoT) and tree search algorithms, refining both candidate selection and decision-making. Empirical results show that BoostStep improves GPT-4o's CoT performance by 4.6% across mathematical benchmarks, significantly surpassing traditional few-shot learning's 1.2%. Moreover, it can achieve an additional 7.5\% gain combined with tree search. Surprisingly, it enhances state-of-the-art LLMs to solve challenging math problems using simpler examples. It improves DeepSeek-R1-671B's performance on AIME by 2.2%, leveraging simple examples only from the MATH dataset.
中文摘要:BoostStep通过步骤对齐的上下文学习机制和有效示范策略,显著提升大语言模型在数学推理中的准确性,在多项测试中取得突破性性能提升。
English Summary: BoostStep enhances large language models' mathematical reasoning by aligning in-context learning examples with specific reasoning steps and using relevant exemplars, significantly improving accuracy across benchmarks.

Authors:Rui Qian, Shuangrui Ding, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Dahua Lin, Jiaqi Wang
Title: Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction
Abstract:
Active Real-time interaction with video LLMs introduces a new paradigm for human-computer interaction, where the model not only understands user intent but also responds while continuously processing streaming video on the fly. Unlike offline video LLMs, which analyze the entire video before answering questions, active real-time interaction requires three capabilities: 1) Perception: real-time video monitoring and interaction capturing. 2) Decision: raising proactive interaction in proper situations, 3) Reaction: continuous interaction with users. However, inherent conflicts exist among the desired capabilities. The Decision and Reaction require a contrary Perception scale and grain, and the autoregressive decoding blocks the real-time Perception and Decision during the Reaction. To unify the conflicted capabilities within a harmonious system, we present Dispider, a system that disentangles Perception, Decision, and Reaction. Dispider features a lightweight proactive streaming video processing module that tracks the video stream and identifies optimal moments for interaction. Once the interaction is triggered, an asynchronous interaction module provides detailed responses, while the processing module continues to monitor the video in the meantime. Our disentangled and asynchronous design ensures timely, contextually accurate, and computationally efficient responses, making Dispider ideal for active real-time interaction for long-duration video streams. Experiments show that Dispider not only maintains strong performance in conventional video QA tasks, but also significantly surpasses previous online models in streaming scenario responses, thereby validating the effectiveness of our architecture. The code and model are released at \url{https://github.com/Mark12Ding/Dispider}.
中文: Dispider通过解耦感知、决策和反应能力,并采用异步设计,实现了视频大语言模型的主动实时交互,确保在流媒体视频分析中提供高效、及时的响应。
English: Dispider introduces a disentangled and asynchronous system that enables active real-time interaction with video LLMs by separating perception, decision, and reaction capabilities, ensuring efficient and timely responses for streaming video analysis.

Authors:Libing Yuan, Shuaibo Hu, Kui Yu, Le Wu
Title: Boosting Explainability through Selective Rationalization in Pre-trained Language Models
Abstract:
The widespread application of pre-trained language models (PLMs) in natural language processing (NLP) has led to increasing concerns about their explainability. Selective rationalization is a self-explanatory framework that selects human-intelligible input subsets as rationales for predictions. Recent studies have shown that applying existing rationalization frameworks to PLMs will result in severe degeneration and failure problems, producing sub-optimal or meaningless rationales. Such failures severely damage trust in rationalization methods and constrain the application of rationalization techniques on PLMs. In this paper, we find that the homogeneity of tokens in the sentences produced by PLMs is the primary contributor to these problems. To address these challenges, we propose a method named Pre-trained Language Model's Rationalization (PLMR), which splits PLMs into a generator and a predictor to deal with NLP tasks while providing interpretable rationales. The generator in PLMR also alleviates homogeneity by pruning irrelevant tokens, while the predictor uses full-text information to standardize predictions. Experiments conducted on two widely used datasets across multiple PLMs demonstrate the effectiveness of the proposed method PLMR in addressing the challenge of applying selective rationalization to PLMs. Codes: https://github.com/ylb777/PLMR.
Chinese: 预训练语言模型因标记同质性常产生不可靠解释,而PLMR方法通过分离生成与预测有效解决了这一问题,提供了可解释的合理化方案。
English: Pre-trained language models often produce unreliable rationales due to token homogeneity, but the proposed PLMR method effectively addresses this by separating generation and prediction to provide interpretable explanations.

Authors:Ali Al-Lawati, Jason Lucas, Prasenjit Mitra
Title: Semantic Captioning: Benchmark Dataset and Graph-Aware Few-Shot In-Context Learning for SQL2Text
Abstract:
Large Language Models (LLMs) have demonstrated remarkable performance in various NLP tasks, including semantic parsing, which translates natural language into formal code representations. However, the reverse process, translating code into natural language, termed semantic captioning, has received less attention. This task is becoming increasingly important as LLMs are integrated into platforms for code generation, security analysis, and educational purposes. In this paper, we focus on the captioning of SQL query (SQL2Text) to address the critical need for understanding and explaining SQL queries in an era where LLM-generated code poses potential security risks. We repurpose Text2SQL datasets for SQL2Text by introducing an iterative ICL prompt using GPT-4o to generate multiple additional utterances, which enhances the robustness of the datasets for the reverse task. We conduct our experiments using in-context learning (ICL) based on different sample selection methods, emphasizing smaller, more computationally efficient LLMs. Our findings demonstrate that leveraging the inherent graph properties of SQL for ICL sample selection significantly outperforms random selection by up to 39% on BLEU score and provides better results than alternative methods. Dataset and codes are published: https://github.com/aliwister/ast-icl.
中文: 本研究针对将SQL查询转化为自然语言的语义描述任务,通过GPT-4o增强数据集并证明基于图结构的示例选择方法能显著提升小规模语言模型的性能表现,效果优于随机选择达39%。
English: This study addresses the semantic captioning task of translating SQL queries into natural language (SQL2Text) by repurposing datasets with GPT-4o-generated utterances and demonstrating that graph-based sample selection for in-context learning significantly outperforms random methods, especially in smaller LLMs.

Authors:Valery Istomin, Oleg Pereziabov, Ilya Afanasyev
Title: Geometry Restoration and Dewarping of Camera-Captured Document Images
Abstract:
This research focuses on developing a method for restoring the topology of digital images of paper documents captured by a camera, using algorithms for detection, segmentation, geometry restoration, and dewarping. Our methodology employs deep learning (DL) for document outline detection, followed by computer vision (CV) to create a topological 2D grid using cubic polynomial interpolation and correct nonlinear distortions by remapping the image. Using classical CV methods makes the document topology restoration process more efficient and faster, as it requires significantly fewer computational resources and memory. We developed a new pipeline for automatic document dewarping and reconstruction, along with a framework and annotated dataset to demonstrate its efficiency. Our experiments confirm the promise of our methodology and its superiority over existing benchmarks (including mobile apps and popular DL solutions, such as RectiNet, DocGeoNet, and DocTr++) both visually and in terms of document readability via Optical Character Recognition (OCR) and geometry restoration metrics. This paves the way for creating high-quality digital copies of paper documents and enhancing the efficiency of OCR systems. Project page: https://github.com/HorizonParadox/DRCCBI
中文: 本研究提出了一种结合深度学习轮廓检测与计算机视觉技术的文档复原方法,通过非线性畸变校正显著提升了OCR可读性和几何复原指标,性能优于当前主流方案。
English: This study introduces an efficient document restoration method combining deep learning for outline detection with computer vision techniques to correct distortions, demonstrating superior performance in OCR readability and geometry metrics compared to existing solutions.

Authors:Yuxiang Bao, Guoliang Kang, Linlin Yang, Xiaoyue Duan, Bo Zhao, Baochang Zhang
Title: Normalizing Batch Normalization for Long-Tailed Recognition
Abstract:
In real-world scenarios, the number of training samples across classes usually subjects to a long-tailed distribution. The conventionally trained network may achieve unexpected inferior performance on the rare class compared to the frequent class. Most previous works attempt to rectify the network bias from the data-level or from the classifier-level. Differently, in this paper, we identify that the bias towards the frequent class may be encoded into features, i.e., the rare-specific features which play a key role in discriminating the rare class are much weaker than the frequent-specific features. Based on such an observation, we introduce a simple yet effective approach, normalizing the parameters of Batch Normalization (BN) layer to explicitly rectify the feature bias. To achieve this end, we represent the Weight/Bias parameters of a BN layer as a vector, normalize it into a unit one and multiply the unit vector by a scalar learnable parameter. Through decoupling the direction and magnitude of parameters in BN layer to learn, the Weight/Bias exhibits a more balanced distribution and thus the strength of features becomes more even. Extensive experiments on various long-tailed recognition benchmarks (i.e., CIFAR-10/100-LT, ImageNet-LT and iNaturalist 2018) show that our method outperforms previous state-of-the-arts remarkably. The code and checkpoints are available at https://github.com/yuxiangbao/NBN.
中文: 本文针对长尾数据集中的类别不平衡问题,提出通过归一化批归一化层参数来纠正特征偏差的方法,在多个基准测试中显著提升了识别性能。
English: This paper addresses class imbalance in long-tailed datasets by proposing a method to normalize Batch Normalization parameters, which rectifies feature bias and improves recognition performance across various benchmarks.

Authors:Dylan Bouchard, Mohit Singh Chauhan, David Skarbrevik, Viren Bajaj, Zeya Ahmad
Title: LangFair: A Python Package for Assessing Bias and Fairness in Large Language Model Use Cases
Abstract:
Large Language Models (LLMs) have been observed to exhibit bias in numerous ways, potentially creating or worsening outcomes for specific groups identified by protected attributes such as sex, race, sexual orientation, or age. To help address this gap, we introduce LangFair, an open-source Python package that aims to equip LLM practitioners with the tools to evaluate bias and fairness risks relevant to their specific use cases. The package offers functionality to easily generate evaluation datasets, comprised of LLM responses to use-case-specific prompts, and subsequently calculate applicable metrics for the practitioner's use case. To guide in metric selection, LangFair offers an actionable decision framework.
Chinese: LangFair 是一个开源 Python 工具包,旨在帮助大语言模型从业者通过生成评估数据集和计算相关指标来评估偏见与公平性风险,并提供可操作的指标选择框架。
English: LangFair is an open-source Python package designed to help LLM practitioners assess bias and fairness risks by enabling easy generation of evaluation datasets and calculation of relevant metrics, supported by a decision framework for metric selection.

Authors:Yibin Wu, Jian Kuang, Xiaoji Niu, Cyrill Stachniss, Lasse Klingbeil, Heiner Kuhlmann
Title: Wheel-GINS: A GNSS/INS Integrated Navigation System with a Wheel-mounted IMU
Abstract:
A long-term accurate and robust localization system is essential for mobile robots to operate efficiently outdoors. Recent studies have shown the significant advantages of the wheel-mounted inertial measurement unit (Wheel-IMU)-based dead reckoning system. However, it still drifts over extended periods because of the absence of external correction signals. To achieve the goal of long-term accurate localization, we propose Wheel-GINS, a Global Navigation Satellite System (GNSS)/inertial navigation system (INS) integrated navigation system using a Wheel-IMU. Wheel-GINS fuses the GNSS position measurement with the Wheel-IMU via an extended Kalman filter to limit the long-term error drift and provide continuous state estimation when the GNSS signal is blocked. Considering the specificities of the GNSS/Wheel-IMU integration, we conduct detailed modeling and online estimation of the Wheel-IMU installation parameters, including the Wheel-IMU leverarm and mounting angle and the wheel radius error. Experimental results have shown that Wheel-GINS outperforms the traditional GNSS/Odometer/INS integrated navigation system during GNSS outages. At the same time, Wheel-GINS can effectively estimate the Wheel-IMU installation parameters online and, consequently, improve the localization accuracy and practicality of the system. The source code of our implementation is publicly available (https://github.com/i2Nav-WHU/Wheel-GINS).
Chinese: Wheel-GINS是一种基于轮式IMU的GNSS/INS组合导航系统,通过扩展卡尔曼滤波器融合GNSS定位数据来抑制长期误差漂移,并能在线估计安装参数,在GNSS信号中断时显著提升定位精度和实用性。
English: Wheel-GINS is a novel GNSS/INS integrated system using a wheel-mounted IMU that reduces long-term drift by fusing GNSS data via an extended Kalman filter and enables online estimation of installation parameters to enhance localization accuracy during GNSS outages.

Authors:Haojin Li, Heng Li, Jianyu Chen, Rihan Zhong, Ke Niu, Huazhu Fu, Jiang Liu
Title: AIF-SFDA: Autonomous Information Filter-driven Source-Free Domain Adaptation for Medical Image Segmentation
Abstract:
Decoupling domain-variant information (DVI) from domain-invariant information (DII) serves as a prominent strategy for mitigating domain shifts in the practical implementation of deep learning algorithms. However, in medical settings, concerns surrounding data collection and privacy often restrict access to both training and test data, hindering the empirical decoupling of information by existing methods. To tackle this issue, we propose an Autonomous Information Filter-driven Source-free Domain Adaptation (AIF-SFDA) algorithm, which leverages a frequency-based learnable information filter to autonomously decouple DVI and DII. Information Bottleneck (IB) and Self-supervision (SS) are incorporated to optimize the learnable frequency filter. The IB governs the information flow within the filter to diminish redundant DVI, while SS preserves DII in alignment with the specific task and image modality. Thus, the autonomous information filter can overcome domain shifts relying solely on target data. A series of experiments covering various medical image modalities and segmentation tasks were conducted to demonstrate the benefits of AIF-SFDA through comparisons with leading algorithms and ablation studies. The code is available at https://github.com/JingHuaMan/AIF-SFDA.
中文: 提出的AIF-SFDA算法通过基于频率的可学习信息过滤器,结合信息瓶颈与自监督优化,实现了无需源数据的域自适应,在多种医学影像分割任务中有效克服了域偏移问题。
English: The proposed AIF-SFDA algorithm autonomously decouples domain-variant and domain-invariant information using a frequency-based filter optimized by Information Bottleneck and Self-supervision, effectively overcoming domain shifts in medical image segmentation without requiring source data access.

Authors:Duygu Sezen Islakoglu, Jan-Christoph Kalo
Title: ChronoSense: Exploring Temporal Understanding in Large Language Models with Time Intervals of Events
Abstract:
Large Language Models (LLMs) have achieved remarkable success in various NLP tasks, yet they still face significant challenges in reasoning and arithmetic. Temporal reasoning, a critical component of natural language understanding, has raised increasing research attention. However, comprehensive testing of Allen's interval relations (e.g., before, after, during) -- a fundamental framework for temporal relationships -- remains underexplored. To fill this gap, we present ChronoSense, a new benchmark for evaluating LLMs' temporal understanding. It includes 16 tasks, focusing on identifying the Allen relation between two temporal events and temporal arithmetic, using both abstract events and real-world data from Wikidata. We assess the performance of seven recent LLMs using this benchmark and the results indicate that models handle Allen relations, even symmetrical ones, quite differently. Moreover, the findings suggest that the models may rely on memorization to answer time-related questions. Overall, the models' low performance highlights the need for improved temporal understanding in LLMs and ChronoSense offers a robust framework for future research in this area. Our dataset and the source code are available at https://github.com/duyguislakoglu/chronosense.
中文: 大型语言模型在时间推理方面存在不足,因此开发了ChronoSense基准测试,揭示其表现不稳定且依赖记忆,凸显了提升时间理解能力的必要性。
English: Large Language Models struggle with temporal reasoning, prompting the creation of ChronoSense, a benchmark that reveals their inconsistent performance and reliance on memorization, underscoring the need for enhanced temporal understanding.

Authors:Xiang Zheng, Longxiang Wang, Yi Liu, Xingjun Ma, Chao Shen, Cong Wang
Title: CALM: Curiosity-Driven Auditing for Large Language Models
Abstract:
Auditing Large Language Models (LLMs) is a crucial and challenging task. In this study, we focus on auditing black-box LLMs without access to their parameters, only to the provided service. We treat this type of auditing as a black-box optimization problem where the goal is to automatically uncover input-output pairs of the target LLMs that exhibit illegal, immoral, or unsafe behaviors. For instance, we may seek a non-toxic input that the target LLM responds to with a toxic output or an input that induces the hallucinative response from the target LLM containing politically sensitive individuals. This black-box optimization is challenging due to the scarcity of feasible points, the discrete nature of the prompt space, and the large search space. To address these challenges, we propose Curiosity-Driven Auditing for Large Language Models (CALM), which uses intrinsically motivated reinforcement learning to finetune an LLM as the auditor agent to uncover potential harmful and biased input-output pairs of the target LLM. CALM successfully identifies derogatory completions involving celebrities and uncovers inputs that elicit specific names under the black-box setting. This work offers a promising direction for auditing black-box LLMs. Our code is available at https://github.com/x-zheng16/CALM.git.
中文摘要:本研究提出CALM方法,通过基于好奇心的强化学习训练审计代理,在无法获取内部参数的黑盒大语言模型中自动识别有害或有偏见的输入输出组合。
English Summary: This study introduces CALM, a curiosity-driven auditing method using reinforcement learning to automatically detect harmful or biased input-output pairs in black-box large language models without accessing their internal parameters.

Authors:Xianhao Zhou, Jianghao Wu, Huangxuan Zhao, Lei Chen, Shaoting Zhang, Guotai Wang
Title: GLFC: Unified Global-Local Feature and Contrast Learning with Mamba-Enhanced UNet for Synthetic CT Generation from CBCT
Abstract:
Generating synthetic Computed Tomography (CT) images from Cone Beam Computed Tomography (CBCT) is desirable for improving the image quality of CBCT. Existing synthetic CT (sCT) generation methods using Convolutional Neural Networks (CNN) and Transformers often face difficulties in effectively capturing both global and local features and contrasts for high-quality sCT generation. In this work, we propose a Global-Local Feature and Contrast learning (GLFC) framework for sCT generation. First, a Mamba-Enhanced UNet (MEUNet) is introduced by integrating Mamba blocks into the skip connections of a high-resolution UNet for effective global and local feature learning. Second, we propose a Multiple Contrast Loss (MCL) that calculates synthetic loss at different intensity windows to improve quality for both soft tissues and bone regions. Experiments on the SynthRAD2023 dataset demonstrate that GLFC improved the SSIM of sCT from 77.91% to 91.50% compared with the original CBCT, and significantly outperformed several existing methods for sCT generation. The code is available at https://github.com/HiLab-git/GLFC
中文摘要:本研究提出了一种全局-局部特征与对比学习框架,通过结合Mamba增强的UNet架构和多窗口对比损失,显著提升了从锥形束CT生成合成CT的图像质量,在SynthRAD2023数据集上实现了91.50%的结构相似性指数。
English Summary: This study introduces a Global-Local Feature and Contrast learning (GLFC) framework that combines Mamba-enhanced UNet architecture with multi-window contrast loss to significantly improve synthetic CT image quality from CBCT data, achieving 91.50% SSIM on SynthRAD2023 benchmark.

Authors:Zhi Qu, Yiran Wang, Jiannan Mao, Chenchen Ding, Hideki Tanaka, Masao Utiyama, Taro Watanabe
Title: Registering Source Tokens to Target Language Spaces in Multilingual Neural Machine Translation
Abstract:
The multilingual neural machine translation (MNMT) aims for arbitrary translations across multiple languages. Although MNMT-specific models trained on parallel data offer low costs in training and deployment, their performance consistently lags behind that of large language models (LLMs). In this work, we introduce registering, a novel method that enables a small MNMT-specific model to compete with LLMs. Specifically, we insert a set of artificial tokens specifying the target language, called registers, into the input sequence between the source and target tokens. By modifying the attention mask, the target token generation only pays attention to the activation of registers, representing the source tokens in the target language space. Experiments on EC-40, a large-scale benchmark, show that our method advances the state-of-the-art of MNMT. We further pre-train two models, namely MITRE (multilingual translation with registers), by 9.3 billion sentence pairs across 24 languages collected from public corpora. One of them, MITRE-913M, outperforms NLLB-3.3B, achieves comparable performance with commercial LLMs, and shows strong adaptability in fine-tuning. Finally, we open-source our models to facilitate further research and development in MNMT: https://github.com/zhiqu22/mitre.
Chinese Summary: 本研究提出“注册”方法,通过插入人工标记引导目标语言生成,使小型多语言神经机器翻译模型能够与大型语言模型竞争,并在EC-40基准测试中取得领先性能。
English Summary: This study introduces "registering," a method that enhances small multilingual neural machine translation (MNMT) models by inserting artificial tokens to guide target language generation, enabling them to compete with large language models (LLMs) and achieving state-of-the-art results on the EC-40 benchmark.

Authors:Chuanbo Hua, Federico Berto, Jiwoo Son, Seunghyun Kang, Changhyun Kwon, Jinkyoo Park
Title: CAMP: Collaborative Attention Model with Profiles for Vehicle Routing Problems
Abstract:
The profiled vehicle routing problem (PVRP) is a generalization of the heterogeneous capacitated vehicle routing problem (HCVRP) in which the objective is to optimize the routes of vehicles to serve client demands subject to different vehicle profiles, with each having a preference or constraint on a per-client basis. While existing learning methods have shown promise for solving the HCVRP in real-time, no learning method exists to solve the more practical and challenging PVRP. In this paper, we propose a Collaborative Attention Model with Profiles (CAMP), a novel approach that learns efficient solvers for PVRP using multi-agent reinforcement learning. CAMP employs a specialized attention-based encoder architecture to embed profiled client embeddings in parallel for each vehicle profile. We design a communication layer between agents for collaborative decision-making across profiled embeddings at each decoding step and a batched pointer mechanism to attend to the profiled embeddings to evaluate the likelihood of the next actions. We evaluate CAMP on two variants of PVRPs: PVRP with preferences, which explicitly influence the reward function, and PVRP with zone constraints with different numbers of agents and clients, demonstrating that our learned solvers achieve competitive results compared to both classical state-of-the-art neural multi-agent models in terms of solution quality and computational efficiency. We make our code openly available at https://github.com/ai4co/camp.
中文: 本文提出CAMP模型,通过多智能体强化学习和协作注意力机制解决带配置的车辆路径问题,在偏好和区域约束两种变体上均实现了与先进方法相媲美的求解质量和计算效率。
English: This paper introduces CAMP, a multi-agent reinforcement learning model that uses collaborative attention to solve the profiled vehicle routing problem, achieving competitive performance in solution quality and efficiency for both preference-based and zone-constrained variants.

Authors:Can Gao, Xiaofeng Tan, Jie Zhou, Weiping Ding, Witold Pedrycz
Title: Fuzzy Granule Density-Based Outlier Detection with Multi-Scale Granular Balls
Abstract:
Outlier detection refers to the identification of anomalous samples that deviate significantly from the distribution of normal data and has been extensively studied and used in a variety of practical tasks. However, most unsupervised outlier detection methods are carefully designed to detect specified outliers, while real-world data may be entangled with different types of outliers. In this study, we propose a fuzzy rough sets-based multi-scale outlier detection method to identify various types of outliers. Specifically, a novel fuzzy rough sets-based method that integrates relative fuzzy granule density is first introduced to improve the capability of detecting local outliers. Then, a multi-scale view generation method based on granular-ball computing is proposed to collaboratively identify group outliers at different levels of granularity. Moreover, reliable outliers and inliers determined by the three-way decision are used to train a weighted support vector machine to further improve the performance of outlier detection. The proposed method innovatively transforms unsupervised outlier detection into a semi-supervised classification problem and for the first time explores the fuzzy rough sets-based outlier detection from the perspective of multi-scale granular balls, allowing for high adaptability to different types of outliers. Extensive experiments carried out on both artificial and UCI datasets demonstrate that the proposed outlier detection method significantly outperforms the state-of-the-art methods, improving the results by at least 8.48% in terms of the Area Under the ROC Curve (AUROC) index. { The source codes are released at \url{https://github.com/Xiaofeng-Tan/MGBOD}. }
中文: 本研究提出了一种基于模糊粗糙集的多尺度异常检测方法,将无监督检测转化为半监督分类问题,在AUROC指标上显著优于现有方法至少8.48%。
English: This study introduces a fuzzy rough sets-based multi-scale outlier detection method that transforms unsupervised detection into semi-supervised classification, significantly outperforming existing techniques by at least 8.48% in AUROC metrics.

Authors:Stephan Goerttler, Yucheng Wang, Emadeldeen Eldele, Min Wu, Fei He
Title: MSA-CNN: A Lightweight Multi-Scale CNN with Attention for Sleep Stage Classification
Abstract:
Recent advancements in machine learning-based signal analysis, coupled with open data initiatives, have fuelled efforts in automatic sleep stage classification. Despite the proliferation of classification models, few have prioritised reducing model complexity, which is a crucial factor for practical applications. In this work, we introduce Multi-Scale and Attention Convolutional Neural Network (MSA-CNN), a lightweight architecture featuring as few as ~10,000 parameters. MSA-CNN leverages a novel multi-scale module employing complementary pooling to eliminate redundant filter parameters and dense convolutions. Model complexity is further reduced by separating temporal and spatial feature extraction and using cost-effective global spatial convolutions. This separation of tasks not only reduces model complexity but also mirrors the approach used by human experts in sleep stage scoring. We evaluated both small and large configurations of MSA-CNN against nine state-of-the-art baseline models across three public datasets, treating univariate and multivariate models separately. Our evaluation, based on repeated cross-validation and re-evaluation of all baseline models, demonstrated that the large MSA-CNN outperformed all baseline models on all three datasets in terms of accuracy and Cohen's kappa, despite its significantly reduced parameter count. Lastly, we explored various model variants and conducted an in-depth analysis of the key modules and techniques, providing deeper insights into the underlying mechanisms. The code for our models, baselines, and evaluation procedures is available at https://github.com/sgoerttler/MSA-CNN.
中文摘要:本研究提出的MSA-CNN轻量级神经网络仅含约1万个参数,在三个数据集上的睡眠分期分类任务中,以显著降低的模型复杂度超越了九个先进基线模型。
English Summary: This study introduces MSA-CNN, a lightweight neural network with only about 10,000 parameters that outperforms nine state-of-the-art models in sleep stage classification across three datasets despite its reduced complexity.

Authors:Shi Bin Hoo, Samuel Müller, David Salinas, Frank Hutter
Title: From Tables to Time: How TabPFN-v2 Outperforms Specialized Time Series Forecasting Models
Abstract:
Foundation models have become increasingly popular for forecasting due to their ability to provide predictions without requiring a lot of training data. In this work, we demonstrate how TabPFN-v2, a general tabular foundation model, can be effectively applied to time series forecasting. We introduce TabPFN-TS, a simple method that combines TabPFN-v2 with lightweight feature engineering to enable both point and probabilistic forecasting. Despite its simplicity and compact size (11M parameters), TabPFN-TS achieves top rank on the public GIFT-Eval leaderboard in both forecasting tasks. Through ablation studies, we investigate factors contributing to this surprising effectiveness, especially considering TabPFN-v2 was pretrained solely on synthetic tabular data with no exposure to time series. Our results highlights the potential of tabular foundation models like TabPFN-v2 as a valuable new approach for time series forecasting. Our implementation is available at https://github.com/PriorLabs/tabpfn-time-series.
基础模型如TabPFN-v2为时间序列预测提供了新方法,尽管仅基于合成表格数据预训练,但通过轻量级特征工程即在基准测试中取得领先性能。
Foundation models like TabPFN-v2 offer a novel approach to time series forecasting, achieving top performance on benchmarks through minimal feature engineering despite being pretrained only on synthetic tabular data.

Authors:Jiexi Zhong, Zhiheng Li, Yubo Cui, Zheng Fang
Title: 4D-CS: Exploiting Cluster Prior for 4D Spatio-Temporal LiDAR Semantic Segmentation
Abstract:
Semantic segmentation of LiDAR points has significant value for autonomous driving and mobile robot systems. Most approaches explore spatio-temporal information of multi-scan to identify the semantic classes and motion states for each point. However, these methods often overlook the segmentation consistency in space and time, which may result in point clouds within the same object being predicted as different categories. To handle this issue, our core idea is to generate cluster labels across multiple frames that can reflect the complete spatial structure and temporal information of objects. These labels serve as explicit guidance for our dual-branch network, 4D-CS, which integrates point-based and cluster-based branches to enable more consistent segmentation. Specifically, in the point-based branch, we leverage historical knowledge to enrich the current feature through temporal fusion on multiple views. In the cluster-based branch, we propose a new strategy to produce cluster labels of foreground objects and apply them to gather point-wise information to derive cluster features. We then merge neighboring clusters across multiple scans to restore missing features due to occlusion. Finally, in the point-cluster fusion stage, we adaptively fuse the information from the two branches to optimize segmentation results. Extensive experiments confirm the effectiveness of the proposed method, and we achieve state-of-the-art results on the multi-scan semantic and moving object segmentation on SemanticKITTI and nuScenes datasets. The code will be available at https://github.com/NEU-REAL/4D-CS.git.
中文: 提出的4D-CS方法通过结合基于点和基于聚类的双分支网络,利用多帧聚类标签提升激光雷达语义分割的时空一致性,在自动驾驶数据集上实现了最优性能。
English: The proposed 4D-CS method enhances LiDAR semantic segmentation consistency by integrating point-based and cluster-based branches with multi-frame cluster labels, achieving state-of-the-art performance on autonomous driving datasets.

Authors:Sahar Salimpour, Jorge Peña-Queralta, Diego Paez-Granados, Jukka Heikkonen, Tomi Westerlund
Title: Sim-to-Real Transfer for Mobile Robots with Reinforcement Learning: from NVIDIA Isaac Sim to Gazebo and Real ROS 2 Robots
Abstract:
Unprecedented agility and dexterous manipulation have been demonstrated with controllers based on deep reinforcement learning (RL), with a significant impact on legged and humanoid robots. Modern tooling and simulation platforms, such as NVIDIA Isaac Sim, have been enabling such advances. This article focuses on demonstrating the applications of Isaac in local planning and obstacle avoidance as one of the most fundamental ways in which a mobile robot interacts with its environments. Although there is extensive research on proprioception-based RL policies, the article highlights less standardized and reproducible approaches to exteroception. At the same time, the article aims to provide a base framework for end-to-end local navigation policies and how a custom robot can be trained in such simulation environment. We benchmark end-to-end policies with the state-of-the-art Nav2, navigation stack in Robot Operating System (ROS). We also cover the sim-to-real transfer process by demonstrating zero-shot transferability of policies trained in the Isaac simulator to real-world robots. This is further evidenced by the tests with different simulated robots, which show the generalization of the learned policy. Finally, the benchmarks demonstrate comparable performance to Nav2, opening the door to quick deployment of state-of-the-art end-to-end local planners for custom robot platforms, but importantly furthering the possibilities by expanding the state and action spaces or task definitions for more complex missions. Overall, with this article we introduce the most important steps, and aspects to consider, in deploying RL policies for local path planning and obstacle avoidance with Isaac Sim training, Gazebo testing, and ROS 2 for real-time inference in real robots. The code is available at https://github.com/sahars93/RL-Navigation.
中文: 本文展示了基于NVIDIA Isaac Sim的深度强化学习如何实现移动机器人的局部路径规划与避障,其性能媲美Nav2导航系统,并成功完成了从仿真到现实环境的策略迁移。
English: This article demonstrates how deep reinforcement learning in NVIDIA Isaac Sim enables effective local planning and obstacle avoidance for mobile robots, achieving comparable performance to Nav2 and showcasing successful sim-to-real transfer.

Authors:Guray Ozgur, Eduarda Caldeira, Tahar Chettaoui, Fadi Boutros, Raghavendra Ramachandra, Naser Damer
Title: FoundPAD: Foundation Models Reloaded for Face Presentation Attack Detection
Abstract:
Although face recognition systems have seen a massive performance enhancement in recent years, they are still targeted by threats such as presentation attacks, leading to the need for generalizable presentation attack detection (PAD) algorithms. Current PAD solutions suffer from two main problems: low generalization to unknown cenarios and large training data requirements. Foundation models (FM) are pre-trained on extensive datasets, achieving remarkable results when generalizing to unseen domains and allowing for efficient task-specific adaption even when little training data are available. In this work, we recognize the potential of FMs to address common PAD problems and tackle the PAD task with an adapted FM for the first time. The FM under consideration is adapted with LoRA weights while simultaneously training a classification header. The resultant architecture, FoundPAD, is highly generalizable to unseen domains, achieving competitive results in several settings under different data availability scenarios and even when using synthetic training data. To encourage reproducibility and facilitate further research in PAD, we publicly release the implementation of FoundPAD at https://github.com/gurayozgur/FoundPAD .
中文: 本研究提出FoundPAD,这是一种高度泛化的呈现攻击检测系统,通过适配基础模型,能在数据有限的情况下有效应对未知场景的攻击威胁。
English: This work introduces FoundPAD, a highly generalizable presentation attack detection system that adapts foundation models to effectively counter unseen domain threats with minimal data requirements.

Authors:Asma Alkalbani, Muhammad Saqib, Ahmed Salim Alrawahi, Abbas Anwar, Chandarnath Adak, Saeed Anwar
Title: RDD4D: 4D Attention-Guided Road Damage Detection And Classification
Abstract:
Road damage detection and assessment are crucial components of infrastructure maintenance. However, current methods often struggle with detecting multiple types of road damage in a single image, particularly at varying scales. This is due to the lack of road datasets with various damage types having varying scales. To overcome this deficiency, first, we present a novel dataset called Diverse Road Damage Dataset (DRDD) for road damage detection that captures the diverse road damage types in individual images, addressing a crucial gap in existing datasets. Then, we provide our model, RDD4D, that exploits Attention4D blocks, enabling better feature refinement across multiple scales. The Attention4D module processes feature maps through an attention mechanism combining positional encoding and "Talking Head" components to capture local and global contextual information. In our comprehensive experimental analysis comparing various state-of-the-art models on our proposed, our enhanced model demonstrated superior performance in detecting large-sized road cracks with an Average Precision (AP) of 0.458 and maintained competitive performance with an overall AP of 0.445. Moreover, we also provide results on the CrackTinyNet dataset; our model achieved around a 0.21 increase in performance. The code, model weights, dataset, and our results are available on \href{https://github.com/msaqib17/Road_Damage_Detection}{https://github.com/msaqib17/Road\_Damage\_Detection}.
Chinese: 本研究提出了新型多样化道路损伤数据集(DRDD)和采用Attention4D模块的RDD4D模型,显著提升了多尺度道路损伤检测能力,在大尺寸裂缝识别和整体损伤检测方面均表现出优越性能。
English: This study introduces a novel Diverse Road Damage Dataset (DRDD) and the RDD4D model with Attention4D blocks, which significantly improves multi-scale road damage detection and achieves superior performance in identifying large cracks and overall damage.

Authors:Chunxin Zheng, Yulin Li, Zhiyuan Song, Zhihai Bi, Jinni Zhou, Boyu Zhou, Jun Ma
Title: Local Reactive Control for Mobile Manipulators with Whole-Body Safety in Complex Environments
Abstract:
Mobile manipulators typically encounter significant challenges in navigating narrow, cluttered environments due to their high-dimensional state spaces and complex kinematics. While reactive methods excel in dynamic settings, they struggle to efficiently incorporate complex, coupled constraints across the entire state space. In this work, we present a novel local reactive controller that reformulates the time-domain single-step problem into a multi-step optimization problem in the spatial domain, leveraging the propagation of a serial kinematic chain. This transformation facilitates the formulation of customized, decoupled link-specific constraints, which is further solved efficiently with augmented Lagrangian differential dynamic programming (AL-DDP). Our approach naturally absorbs spatial kinematic propagation in the forward pass and processes all link-specific constraints simultaneously during the backward pass, enhancing both constraint management and computational efficiency. Notably, in this framework, we formulate collision avoidance constraints for each link using accurate geometric models with extracted free regions, and this improves the maneuverability of the mobile manipulator in narrow, cluttered spaces. Experimental results showcase significant improvements in safety, efficiency, and task completion rates. These findings underscore the robustness of the proposed method, particularly in narrow, cluttered environments where conventional approaches could falter. The open-source project can be found at https://github.com/Chunx1nZHENG/MM-with-Whole-Body-Safety-Release.git.
Chinese: 本文提出了一种新型反应式控制器,将单步优化问题转化为空间多步优化,利用增强拉格朗日微分动态规划高效处理解耦的连杆特定约束,显著提升了移动机械臂在狭窄杂乱环境中的导航能力。
English: This paper introduces a novel reactive controller that transforms single-step optimization into a spatial multi-step problem, enabling efficient handling of decoupled link-specific constraints through augmented Lagrangian differential dynamic programming to enhance mobile manipulator navigation in cluttered environments.

Authors:Niloufar Eghbali, Hassan Bagher-Ebadian, Tuka Alhanai, Mohammad M. Ghassemi
Title: GLoG-CSUnet: Enhancing Vision Transformers with Adaptable Radiomic Features for Medical Image Segmentation
Abstract:
Vision Transformers (ViTs) have shown promise in medical image semantic segmentation (MISS) by capturing long-range correlations. However, ViTs often struggle to model local spatial information effectively, which is essential for accurately segmenting fine anatomical details, particularly when applied to small datasets without extensive pre-training. We introduce Gabor and Laplacian of Gaussian Convolutional Swin Network (GLoG-CSUnet), a novel architecture enhancing Transformer-based models by incorporating learnable radiomic features. This approach integrates dynamically adaptive Gabor and Laplacian of Gaussian (LoG) filters to capture texture, edge, and boundary information, enhancing the feature representation processed by the Transformer model. Our method uniquely combines the long-range dependency modeling of Transformers with the texture analysis capabilities of Gabor and LoG features. Evaluated on the Synapse multi-organ and ACDC cardiac segmentation datasets, GLoG-CSUnet demonstrates significant improvements over state-of-the-art models, achieving a 1.14% increase in Dice score for Synapse and 0.99% for ACDC, with minimal computational overhead (only 15 and 30 additional parameters, respectively). GLoG-CSUnet's flexible design allows integration with various base models, offering a promising approach for incorporating radiomics-inspired feature extraction in Transformer architectures for medical image analysis. The code implementation is available on GitHub at: https://github.com/HAAIL/GLoG-CSUnet.
中文: GLoG-CSUnet架构通过整合自适应Gabor和拉普拉斯高斯滤波器来增强视觉Transformer在医学图像分割中的表现,能有效捕捉细微纹理特征,在基准数据集上以极小的计算开销实现了性能突破。
English: The GLoG-CSUnet architecture enhances Vision Transformers for medical image segmentation by integrating adaptive Gabor and Laplacian of Gaussian filters to capture fine texture details, achieving superior performance on benchmark datasets with minimal computational overhead.

Authors:Binyu Zhang, Zhu Meng, Junhao Dong, Fei Su, Zhicheng Zhao
Title: ICFNet: Integrated Cross-modal Fusion Network for Survival Prediction
Abstract:
Survival prediction is a crucial task in the medical field and is essential for optimizing treatment options and resource allocation. However, current methods often rely on limited data modalities, resulting in suboptimal performance. In this paper, we propose an Integrated Cross-modal Fusion Network (ICFNet) that integrates histopathology whole slide images, genomic expression profiles, patient demographics, and treatment protocols. Specifically, three types of encoders, a residual orthogonal decomposition module and a unification fusion module are employed to merge multi-modal features to enhance prediction accuracy. Additionally, a balanced negative log-likelihood loss function is designed to ensure fair training across different patients. Extensive experiments demonstrate that our ICFNet outperforms state-of-the-art algorithms on five public TCGA datasets, including BLCA, BRCA, GBMLGG, LUAD, and UCEC, and shows its potential to support clinical decision-making and advance precision medicine. The codes are available at: https://github.com/binging512/ICFNet.
中文: 本文提出的ICFNet通过整合病理图像、基因组数据和临床信息的多模态融合网络,在五种癌症数据集上显著提升了生存预测准确率,展现出良好的临床应用前景。
English: This paper introduces ICFNet, a novel multi-modal fusion network that integrates histopathology images, genomic data, and clinical information to significantly improve survival prediction accuracy across five cancer datasets, demonstrating strong potential for clinical application.

Authors:Haoyu Liu, Shaohan Huang, Jianfeng Liu, Yuefeng Zhan, Hao Sun, Weiwei Deng, Feng Sun, Furu Wei, Qi Zhang
Title: GeAR: Generation Augmented Retrieval
Abstract:
Document retrieval techniques are essential for developing large-scale information systems. The common approach involves using a bi-encoder to compute the semantic similarity between a query and documents. However, the scalar similarity often fail to reflect enough information, hindering the interpretation of retrieval results. In addition, this process primarily focuses on global semantics, overlooking the finer-grained semantic relationships between the query and the document's content. In this paper, we introduce a novel method, $\textbf{Ge}$neration $\textbf{A}$ugmented $\textbf{R}$etrieval ($\textbf{GeAR}$), which not only improves the global document-query similarity through contrastive learning, but also integrates well-designed fusion and decoding modules. This enables GeAR to generate relevant context within the documents based on a given query, facilitating learning to retrieve local fine-grained information. Furthermore, when used as a retriever, GeAR does not incur any additional computational cost over bi-encoders. GeAR exhibits competitive retrieval performance across diverse scenarios and tasks. Moreover, qualitative analysis and the results generated by GeAR provide novel insights into the interpretation of retrieval results. The code, data, and models will be released at \href{https://github.com/microsoft/LMOps}{https://github.com/microsoft/LMOps}.
Chinese: 本文提出GeAR这一新颖文档检索方法,通过对比学习提升全局语义相似度,并生成细粒度上下文信息而不增加计算成本,在多种任务中展现出竞争优势。
English: This paper introduces GeAR, a novel document retrieval method that enhances global semantic similarity through contrastive learning and generates fine-grained contextual information without extra computational cost, demonstrating competitive performance across various tasks.

Authors:Yifan Li, Zhixin Lai, Wentao Bao, Zhen Tan, Anh Dao, Kewei Sui, Jiayi Shen, Dong Liu, Huan Liu, Yu Kong
Title: Visual Large Language Models for Generalized and Specialized Applications
Abstract:
Visual-language models (VLM) have emerged as a powerful tool for learning a unified embedding space for vision and language. Inspired by large language models, which have demonstrated strong reasoning and multi-task capabilities, visual large language models (VLLMs) are gaining increasing attention for building general-purpose VLMs. Despite the significant progress made in VLLMs, the related literature remains limited, particularly from a comprehensive application perspective, encompassing generalized and specialized applications across vision (image, video, depth), action, and language modalities. In this survey, we focus on the diverse applications of VLLMs, examining their using scenarios, identifying ethics consideration and challenges, and discussing future directions for their development. By synthesizing these contents, we aim to provide a comprehensive guide that will pave the way for future innovations and broader applications of VLLMs. The paper list repository is available: https://github.com/JackYFL/awesome-VLLMs.
Chinese: 本综述全面探讨了视觉大语言模型(VLLMs)在多模态领域的广泛应用,分析其使用场景、伦理考量、挑战及未来发展方向,旨在为后续创新提供系统指导。
English: This survey comprehensively examines the diverse applications of visual large language models (VLLMs) across multiple modalities, addressing their usage scenarios, ethical considerations, challenges, and future directions to guide further innovation.

Authors:Xiaojiao Guo, Xuhang Chen, Shuqiang Wang, Chi-Man Pun
Title: Underwater Image Restoration Through a Prior Guided Hybrid Sense Approach and Extensive Benchmark Analysis
Abstract:
Underwater imaging grapples with challenges from light-water interactions, leading to color distortions and reduced clarity. In response to these challenges, we propose a novel Color Balance Prior \textbf{Guided} \textbf{Hyb}rid \textbf{Sens}e \textbf{U}nderwater \textbf{I}mage \textbf{R}estoration framework (\textbf{GuidedHybSensUIR}). This framework operates on multiple scales, employing the proposed \textbf{Detail Restorer} module to restore low-level detailed features at finer scales and utilizing the proposed \textbf{Feature Contextualizer} module to capture long-range contextual relations of high-level general features at a broader scale. The hybridization of these different scales of sensing results effectively addresses color casts and restores blurry details. In order to effectively point out the evolutionary direction for the model, we propose a novel \textbf{Color Balance Prior} as a strong guide in the feature contextualization step and as a weak guide in the final decoding phase. We construct a comprehensive benchmark using paired training data from three real-world underwater datasets and evaluate on six test sets, including three paired and three unpaired, sourced from four real-world underwater datasets. Subsequently, we tested 14 traditional and retrained 23 deep learning existing underwater image restoration methods on this benchmark, obtaining metric results for each approach. This effort aims to furnish a valuable benchmarking dataset for standard basis for comparison. The extensive experiment results demonstrate that our method outperforms 37 other state-of-the-art methods overall on various benchmark datasets and metrics, despite not achieving the best results in certain individual cases. The code and dataset are available at \href{https://github.com/CXH-Research/GuidedHybSensUIR}{https://github.com/CXH-Research/GuidedHybSensUIR}.
Chinese: 提出的GuidedHybSensUIR框架通过结合多尺度细节恢复和新型色彩平衡先验,有效修复水下图像,在多个基准测试中优于37种现有最优方法。
English: The proposed GuidedHybSensUIR framework effectively restores underwater images by combining multi-scale detail restoration with a novel Color Balance Prior, outperforming 37 state-of-the-art methods across various benchmarks.

Authors:Yang Ouyang, Hengrui Gu, Shuhang Lin, Wenyue Hua, Jie Peng, Bhavya Kailkhura, Meijun Gao, Tianlong Chen, Kaixiong Zhou
Title: Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense
Abstract:
As large language models (LLMs) are increasingly deployed in diverse applications, including chatbot assistants and code generation, aligning their behavior with safety and ethical standards has become paramount. However, jailbreak attacks, which exploit vulnerabilities to elicit unintended or harmful outputs, threaten LLMs' safety significantly. In this paper, we introduce Layer-AdvPatcher, a novel methodology designed to defend against jailbreak attacks by utilizing an unlearning strategy to patch specific layers within LLMs through self-augmented datasets. Our insight is that certain layer(s), tend to produce affirmative tokens when faced with harmful prompts. By identifying these layers and adversarially exposing them to generate more harmful data, one can understand their inherent and diverse vulnerabilities to attacks. With these exposures, we then "unlearn" these issues, reducing the impact of affirmative tokens and hence minimizing jailbreak risks while keeping the model's responses to safe queries intact. We conduct extensive experiments on two models, four benchmark datasets, and multiple state-of-the-art jailbreak attacks to demonstrate the efficacy of our approach. Results indicate that our framework reduces the harmfulness and attack success rate of jailbreak attacks without compromising utility for benign queries compared to recent defense methods. Our code is publicly available at: https://github.com/oyy2000/LayerAdvPatcher
中文: Layer-AdvPatcher通过逆向学习修补大语言模型中的脆弱层,在保持良性查询功能的同时显著降低了越狱攻击的成功率和危害性。
English: Layer-AdvPatcher is a defense method that patches vulnerable layers in LLMs through adversarial unlearning, effectively reducing jailbreak risks while maintaining model utility for safe queries.

Authors:Saleh Ashkboos, Mahdi Nikdan, Soroush Tabesh, Roberto L. Castro, Torsten Hoefler, Dan Alistarh
Title: HALO: Hadamard-Assisted Lower-Precision Optimization for LLMs
Abstract:
Quantized training of Large Language Models (LLMs) remains an open challenge, as maintaining accuracy while performing all matrix multiplications in low precision has proven difficult. This is particularly the case when fine-tuning pre-trained models, which can have large weight and activation outlier values that make lower-precision optimization difficult. To address this, we present HALO, a novel quantization-aware training approach for Transformers that enables accurate and efficient low-precision training by combining 1) strategic placement of Hadamard rotations in both forward and backward passes, which mitigate outliers, 2) high-performance kernel support, and 3) FSDP integration for low-precision communication. Our approach ensures that all large matrix multiplications during the forward and backward passes are executed in lower precision. Applied to LLAMA-family models, HALO achieves near-full-precision-equivalent results during fine-tuning on various tasks, while delivering up to 1.41x end-to-end speedup for full fine-tuning on RTX 4090 GPUs. HALO efficiently supports both standard and parameterefficient fine-tuning (PEFT). Our results demonstrate the first practical approach to fully quantized LLM fine-tuning that maintains accuracy in 8-bit precision, while delivering performance benefits. Code is available at \url{https://github.com/IST-DASLab/HALO}.
中文: HALO提出了一种新颖的Transformer量化感知训练方法,通过巧妙运用哈达玛旋转和高性能内核,实现了LLMs在低精度下的精确微调,在保持接近全精度结果的同时获得最高1.41倍的加速效果。
English: HALO introduces a novel quantization-aware training method for Transformers that enables accurate low-precision fine-tuning of LLMs by strategically using Hadamard rotations and high-performance kernels, achieving near-full-precision results with up to 1.41x speedup.

Authors:Jiaping Wang, Simiao Zhang, Qiao-Chu He, Yifan Chen
Title: LeetDecoding: A PyTorch Library for Exponentially Decaying Causal Linear Attention with CUDA Implementations
Abstract:
The machine learning and data science community has made significant while dispersive progress in accelerating transformer-based large language models (LLMs), and one promising approach is to replace the original causal attention in a generative pre-trained transformer (GPT) with \emph{exponentially decaying causal linear attention}. In this paper, we present LeetDecoding, which is the first Python package that provides a large set of computation routines for this fundamental operator. The launch of LeetDecoding was motivated by the current lack of (1) clear understanding of the complexity regarding this operator, (2) a comprehensive collection of existing computation methods (usually spread in seemingly unrelated fields), and (3) CUDA implementations for fast inference on GPU. LeetDecoding's design is easy to integrate with existing linear-attention LLMs, and allows for researchers to benchmark and evaluate new computation methods for exponentially decaying causal linear attention. The usage of LeetDecoding does not require any knowledge of GPU programming and the underlying complexity analysis, intentionally making LeetDecoding accessible to LLM practitioners. The source code of LeetDecoding is provided at \href{https://github.com/Computational-Machine-Intelligence/LeetDecoding}{this GitHub repository}, and users can simply install LeetDecoding by the command \texttt{pip install leet-decoding}.
中文: LeetDecoding是首个提供指数衰减因果线性注意力计算功能的Python工具包,旨在解决该算子理解不足、方法分散和CUDA实现缺失的问题,支持无缝集成与性能评估,且无需GPU编程知识即可使用。
English: LeetDecoding is the first Python package offering comprehensive computation routines for exponentially decaying causal linear attention in transformers, addressing the lack of understanding, method collection, and CUDA implementations while enabling easy integration and benchmarking without requiring GPU programming expertise.

Authors:Lin Wang, Qing Li
Title: Efficient Graph Condensation via Gaussian Process
Abstract:
Graph condensation reduces the size of large graphs while preserving performance, addressing the scalability challenges of Graph Neural Networks caused by computational inefficiencies on large datasets. Existing methods often rely on bi-level optimization, requiring extensive GNN training and limiting their scalability. To address these issues, this paper proposes Graph Condensation via Gaussian Process (GCGP), a novel and computationally efficient approach to graph condensation. GCGP utilizes a Gaussian Process (GP), with the condensed graph serving as observations, to estimate the posterior distribution of predictions. This approach eliminates the need for the iterative and resource-intensive training typically required by GNNs. To enhance the capability of the GCGP in capturing dependencies between function values, we derive a specialized covariance function that incorporates structural information. This covariance function broadens the receptive field of input nodes by local neighborhood aggregation, thereby facilitating the representation of intricate dependencies within the nodes. To address the challenge of optimizing binary structural information in condensed graphs, Concrete random variables are utilized to approximate the binary adjacency matrix in a continuous counterpart. This relaxation process allows the adjacency matrix to be represented in a differentiable form, enabling the application of gradient-based optimization techniques to discrete graph structures. Experimental results show that the proposed GCGP method efficiently condenses large-scale graph data while preserving predictive performance, addressing the scalability and efficiency challenges. The implementation of our method is publicly available at https://github.com/WANGLin0126/GCGP.
中文摘要:本文提出基于高斯过程的图压缩方法(GCGP),通过高斯过程替代传统图神经网络训练,利用定制协方差函数和梯度优化技术,在保持预测性能的同时高效压缩大规模图数据,解决了可扩展性难题。
English Summary: This paper introduces Graph Condensation via Gaussian Process (GCGP), a novel method that efficiently reduces graph size using Gaussian Processes to bypass intensive GNN training, incorporating specialized covariance functions and gradient optimization to maintain performance while addressing scalability.

Authors:Lin Wang, Qing Li
Title: Efficient Graph Condensation via Gaussian Process
Abstract:
Graph condensation reduces the size of large graphs while preserving performance, addressing the scalability challenges of Graph Neural Networks caused by computational inefficiencies on large datasets. Existing methods often rely on bi-level optimization, requiring extensive GNN training and limiting their scalability. To address these issues, this paper proposes Graph Condensation via Gaussian Process (GCGP), a novel and computationally efficient approach to graph condensation. GCGP utilizes a Gaussian Process (GP), with the condensed graph serving as observations, to estimate the posterior distribution of predictions. This approach eliminates the need for the iterative and resource-intensive training typically required by GNNs. To enhance the capability of the GCGP in capturing dependencies between function values, we derive a specialized covariance function that incorporates structural information. This covariance function broadens the receptive field of input nodes by local neighborhood aggregation, thereby facilitating the representation of intricate dependencies within the nodes. To address the challenge of optimizing binary structural information in condensed graphs, Concrete random variables are utilized to approximate the binary adjacency matrix in a continuous counterpart. This relaxation process allows the adjacency matrix to be represented in a differentiable form, enabling the application of gradient-based optimization techniques to discrete graph structures. Experimental results show that the proposed GCGP method efficiently condenses large-scale graph data while preserving predictive performance, addressing the scalability and efficiency challenges. The implementation of our method is publicly available at https://github.com/WANGLin0126/GCGP.
中文摘要:本文提出基于高斯过程的图压缩方法(GCGP),通过高斯过程替代传统图神经网络训练,利用定制协方差函数和梯度优化技术,在保持预测性能的同时高效压缩大规模图数据,解决了可扩展性难题。
English Summary: This paper introduces Graph Condensation via Gaussian Process (GCGP), a novel method that efficiently reduces graph size using Gaussian Processes to bypass intensive GNN training, incorporating specialized covariance functions and gradient optimization to maintain performance while addressing scalability.

Authors:Yibo Zhang
Title: KM-UNet KAN Mamba UNet for medical image segmentation
Abstract:
Medical image segmentation is a critical task in medical imaging analysis. Traditional CNN-based methods struggle with modeling long-range dependencies, while Transformer-based models, despite their success, suffer from quadratic computational complexity. To address these limitations, we propose KM-UNet, a novel U-shaped network architecture that combines the strengths of Kolmogorov-Arnold Networks (KANs) and state-space models (SSMs). KM-UNet leverages the Kolmogorov-Arnold representation theorem for efficient feature representation and SSMs for scalable long-range modeling, achieving a balance between accuracy and computational efficiency. We evaluate KM-UNet on five benchmark datasets: ISIC17, ISIC18, CVC, BUSI, and GLAS. Experimental results demonstrate that KM-UNet achieves competitive performance compared to state-of-the-art methods in medical image segmentation tasks. To the best of our knowledge, KM-UNet is the first medical image segmentation framework integrating KANs and SSMs. This work provides a valuable baseline and new insights for the development of more efficient and interpretable medical image segmentation systems. The code is open source at https://github.com/2760613195/KM_UNet Keywords:KAN,Manba, state-space models,UNet, Medical image segmentation, Deep learning
中文:KM-UNet是一种新颖的U型网络,结合Kolmogorov-Arnold网络与状态空间模型,在多个基准数据集上实现了高效精准的医学图像分割,展现出卓越性能。
English: KM-UNet is a novel U-shaped network that integrates Kolmogorov-Arnold Networks and state-space models to achieve efficient and accurate medical image segmentation, demonstrating competitive performance on multiple benchmark datasets.

Authors:Jaeyoung Kim, Jongho Lee, Hong-Jun Choi, Ting-Yao Hsu, Chieh-Yang Huang, Sungchul Kim, Ryan Rossi, Tong Yu, Clyde Lee Giles, Ting-Hao 'Kenneth' Huang, Sungchul Choi
Title: Multi-LLM Collaborative Caption Generation in Scientific Documents
Abstract:
Scientific figure captioning is a complex task that requires generating contextually appropriate descriptions of visual content. However, existing methods often fall short by utilizing incomplete information, treating the task solely as either an image-to-text or text summarization problem. This limitation hinders the generation of high-quality captions that fully capture the necessary details. Moreover, existing data sourced from arXiv papers contain low-quality captions, posing significant challenges for training large language models (LLMs). In this paper, we introduce a framework called Multi-LLM Collaborative Figure Caption Generation (MLBCAP) to address these challenges by leveraging specialized LLMs for distinct sub-tasks. Our approach unfolds in three key modules: (Quality Assessment) We utilize multimodal LLMs to assess the quality of training data, enabling the filtration of low-quality captions. (Diverse Caption Generation) We then employ a strategy of fine-tuning/prompting multiple LLMs on the captioning task to generate candidate captions. (Judgment) Lastly, we prompt a prominent LLM to select the highest quality caption from the candidates, followed by refining any remaining inaccuracies. Human evaluations demonstrate that informative captions produced by our approach rank better than human-written captions, highlighting its effectiveness. Our code is available at https://github.com/teamreboott/MLBCAP
中文摘要:MLBCAP框架通过协同多个专用大语言模型,实现了训练数据质量评估、多样化描述生成与最优描述筛选的三步优化,在科学图表标注任务中生成比人工撰写更优质的结果。
English Summary: The MLBCAP framework enhances scientific figure captioning by employing multiple specialized LLMs to filter low-quality data, generate diverse candidate captions, and select the most accurate description, outperforming human-written captions in evaluations.

Authors:Haichao Liu, Kai Chen, Yulin Li, Zhenmin Huang, Ming Liu, Jun Ma
Title: UDMC: Unified Decision-Making and Control Framework for Urban Autonomous Driving with Motion Prediction of Traffic Participants
Abstract:
Current autonomous driving systems often struggle to balance decision-making and motion control while ensuring safety and traffic rule compliance, especially in complex urban environments. Existing methods may fall short due to separate handling of these functionalities, leading to inefficiencies and safety compromises. To address these challenges, we introduce UDMC, an interpretable and unified Level 4 autonomous driving framework. UDMC integrates decision-making and motion control into a single optimal control problem (OCP), considering the dynamic interactions with surrounding vehicles, pedestrians, road lanes, and traffic signals. By employing innovative potential functions to model traffic participants and regulations, and incorporating a specialized motion prediction module, our framework enhances on-road safety and rule adherence. The integrated design allows for real-time execution of flexible maneuvers suited to diverse driving scenarios. High-fidelity simulations conducted in CARLA exemplify the framework's computational efficiency, robustness, and safety, resulting in superior driving performance when compared against various baseline models. Our open-source project is available at https://github.com/henryhcliu/udmc_carla.git.
中文摘要:UDMC框架将决策与运动控制整合为统一的最优控制问题,通过实时自适应操作提升复杂城市场景中的安全性和交通规则遵守能力。
English Summary: The UDMC framework integrates decision-making and motion control into a unified optimal control problem, enhancing safety and traffic rule compliance through real-time adaptable maneuvers in complex urban environments.

Authors:Yaohui Wang, Zicong Wang, Fanfeng Meng, Yanjing Wang, Yang Ou, Lizhou Wu, Wentao Hong, Xuran Ge, Jijun Cao
Title: A Full-System Simulation Framework for CXL-Based SSD Memory System
Abstract:
Compute eXpress Link (CXL) is a promising technology for memory disaggregation and expansion. Especially, CXL makes it more effectively for large-capacity storage devices such as Solid State Drive (SSD) to be deployed in the memory pool. However, CXL-based SSDs are still in early stages, necessitating the development of reliable simulation tools. In this paper, we propose CXL-SSD-Sim, the first open-source full-system simulator designed to simulate CXL-based SSD memory system. Constructed on the foundation of gem5 and SimpleSSD, CXL-SSD-Sim extends an high fidelity SSD memory expander model along with the corresponding device driver. In addition, CXL-SSD-Sim models a DRAM layer as a caching mechanism for the SSD, meticulously engineered to counteract latency issues inherent to CXL-based SSD memory access. Experiments are performed among five different memory devices with CXL-SSD-Sim in aspect of latency, bandwidth and real-world benchmark performance. These experiments serve to underscore the efficacy of our simulation tool in providing a comprehensive analysis of CXL-based SSD memory systems. The CXL-SSD-Sim simulator is available at https://github.com/WangYaohuii/CXL-SSD-Sim.
Chinese: 本文提出了首个开源全系统模拟器CXL-SSD-Sim,用于模拟基于CXL的SSD内存系统,通过DRAM缓存层缓解访问延迟,并通过实验验证了其全面的性能分析能力。
English: This paper introduces CXL-SSD-Sim, the first open-source full-system simulator for CXL-based SSD memory systems, which models a DRAM cache to mitigate latency and enables comprehensive performance analysis through experiments.

Authors:Dawei Dai, Mingming Jia, Yinxiu Zhou, Hang Xing, Chenghang Li
Title: Face-MakeUp: Multimodal Facial Prompts for Text-to-Image Generation
Abstract:
Facial images have extensive practical applications. Although the current large-scale text-image diffusion models exhibit strong generation capabilities, it is challenging to generate the desired facial images using only text prompt. Image prompts are a logical choice. However, current methods of this type generally focus on general domain. In this paper, we aim to optimize image makeup techniques to generate the desired facial images. Specifically, (1) we built a dataset of 4 million high-quality face image-text pairs (FaceCaptionHQ-4M) based on LAION-Face to train our Face-MakeUp model; (2) to maintain consistency with the reference facial image, we extract/learn multi-scale content features and pose features for the facial image, integrating these into the diffusion model to enhance the preservation of facial identity features for diffusion models. Validation on two face-related test datasets demonstrates that our Face-MakeUp can achieve the best comprehensive performance.All codes are available at:https://github.com/ddw2AIGROUP2CQUPT/Face-MakeUp
中文: 本文提出Face-MakeUp模型,通过提取参考图像的多尺度内容和姿态特征并融入扩散模型,基于新建的400万高质量人脸图像-文本数据集进行训练,有效提升生成面部图像的身份一致性,在测试中表现最佳。
English: This paper introduces Face-MakeUp, a model that enhances facial image generation by integrating multi-scale content and pose features from reference images, trained on a new dataset of 4 million face image-text pairs to improve identity consistency and achieve top performance.

Authors:Sung Jin Um, Dongjin Kim, Sangmin Lee, Jung Uk Kim
Title: Watch Video, Catch Keyword: Context-aware Keyword Attention for Moment Retrieval and Highlight Detection
Abstract:
The goal of video moment retrieval and highlight detection is to identify specific segments and highlights based on a given text query. With the rapid growth of video content and the overlap between these tasks, recent works have addressed both simultaneously. However, they still struggle to fully capture the overall video context, making it challenging to determine which words are most relevant. In this paper, we present a novel Video Context-aware Keyword Attention module that overcomes this limitation by capturing keyword variation within the context of the entire video. To achieve this, we introduce a video context clustering module that provides concise representations of the overall video context, thereby enhancing the understanding of keyword dynamics. Furthermore, we propose a keyword weight detection module with keyword-aware contrastive learning that incorporates keyword information to enhance fine-grained alignment between visual and textual features. Extensive experiments on the QVHighlights, TVSum, and Charades-STA benchmarks demonstrate that our proposed method significantly improves performance in moment retrieval and highlight detection tasks compared to existing approaches. Our code is available at: https://github.com/VisualAIKHU/Keyword-DETR
中文摘要:本文提出了一种视频上下文感知关键词注意力模块,通过捕捉关键词在整体视频语境中的变化并利用上下文聚类和对比学习增强视觉与文本特征的细粒度对齐,在多个基准测试中显著提升了视频片段检索和高亮检测的性能。
English Summary: This paper introduces a Video Context-aware Keyword Attention module that enhances video moment retrieval and highlight detection by capturing keyword variations and improving visual-text alignment through context clustering and contrastive learning, achieving superior performance on benchmark datasets.

Authors:Zhe Chen, Yusheng Liao, Shuyang Jiang, Pingjie Wang, Yiqiu Guo, Yanfeng Wang, Yu Wang
Title: Towards Omni-RAG: Comprehensive Retrieval-Augmented Generation for Large Language Models in Medical Applications
Abstract:
Large language models hold promise for addressing medical challenges, such as medical diagnosis reasoning, research knowledge acquisition, clinical decision-making, and consumer health inquiry support. However, they often generate hallucinations due to limited medical knowledge. Incorporating external knowledge is therefore critical, which necessitates multi-source knowledge acquisition. We address this challenge by framing it as a source planning problem, which is to formulate context-appropriate queries tailored to the attributes of diverse sources. Existing approaches either overlook source planning or fail to achieve it effectively due to misalignment between the model's expectation of the sources and their actual content. To bridge this gap, we present MedOmniKB, a repository comprising multigenre and multi-structured medical knowledge sources. Leveraging these sources, we propose the Source Planning Optimisation method, which enhances multi-source utilisation. Our approach involves enabling an expert model to explore and evaluate potential plans while training a smaller model to learn source alignment. Experimental results demonstrate that our method substantially improves multi-source planning performance, enabling the optimised small model to achieve state-of-the-art results in leveraging diverse medical knowledge sources.
中文摘要:大语言模型虽能处理医疗任务,但因知识有限常产生错误,为此我们构建了MedOmniKB知识库并提出源规划优化方法,显著提升了多源知识利用效率,在医疗应用中取得了最优效果。
English Summary: Large language models can tackle medical tasks but often produce errors due to limited knowledge, so we developed MedOmniKB and a source planning optimization method to improve multi-source knowledge integration, achieving top performance in medical applications.

Authors:Yihang Tao, Senkang Hu, Yue Hu, Haonan An, Hangcheng Cao, Yuguang Fang
Title: GCP: Guarded Collaborative Perception with Spatial-Temporal Aware Malicious Agent Detection
Abstract:
Collaborative perception significantly enhances autonomous driving safety by extending each vehicle's perception range through message sharing among connected and autonomous vehicles. Unfortunately, it is also vulnerable to adversarial message attacks from malicious agents, resulting in severe performance degradation. While existing defenses employ hypothesis-and-verification frameworks to detect malicious agents based on single-shot outliers, they overlook temporal message correlations, which can be circumvented by subtle yet harmful perturbations in model input and output spaces. This paper reveals a novel blind area confusion (BAC) attack that compromises existing single-shot outlier-based detection methods. As a countermeasure, we propose GCP, a Guarded Collaborative Perception framework based on spatial-temporal aware malicious agent detection, which maintains single-shot spatial consistency through a confidence-scaled spatial concordance loss, while simultaneously examining temporal anomalies by reconstructing historical bird's eye view motion flows in low-confidence regions. We also employ a joint spatial-temporal Benjamini-Hochberg test to synthesize dual-domain anomaly results for reliable malicious agent detection. Extensive experiments demonstrate GCP's superior performance under diverse attack scenarios, achieving up to 34.69% improvements in AP@0.5 compared to the state-of-the-art CP defense strategies under BAC attacks, while maintaining consistent 5-8% improvements under other typical attacks. Code will be released at https://github.com/CP-Security/GCP.git.
中文: 本文揭示了一种新型盲区混淆攻击对协同感知系统的威胁,并提出基于时空检测的GCP防御框架,该框架在多种攻击场景下均展现出优越性能,检测精度提升最高达34.69%。
English: This paper introduces a novel blind area confusion attack that exploits vulnerabilities in collaborative perception systems and proposes GCP, a guarded framework using spatial-temporal detection to effectively counter such attacks with significant performance improvements.

Authors:Binh-Nguyen Nguyen, Yang He
Title: Swift Cross-Dataset Pruning: Enhancing Fine-Tuning Efficiency in Natural Language Understanding
Abstract:
Dataset pruning aims to select a subset of a dataset for efficient model training. While data efficiency in natural language processing has primarily focused on within-corpus scenarios during model pre-training, efficient dataset pruning for task-specific fine-tuning across diverse datasets remains challenging due to variability in dataset sizes, data distributions, class imbalance and label spaces. Current cross-dataset pruning techniques for fine-tuning often rely on computationally expensive sample ranking processes, typically requiring full dataset training or reference models. We address this gap by proposing Swift Cross-Dataset Pruning (SCDP). Specifically, our approach uses TF-IDF embeddings with geometric median to rapidly evaluate sample importance. We then apply dataset size-adaptive pruning to ensure diversity: for smaller datasets, we retain samples far from the geometric median, while for larger ones, we employ distance-based stratified pruning. Experimental results on six diverse datasets demonstrate the effectiveness of our method, spanning various tasks and scales while significantly reducing computational resources. Source code is available at: https://github.com/he-y/NLP-Dataset-Pruning
中文: 本文提出Swift跨数据集剪枝方法,利用TF-IDF嵌入和几何中位数快速评估样本重要性,根据数据集规模自适应调整剪枝策略以保持多样性,在显著降低计算资源的同时在六个不同数据集上验证了有效性。
English: This paper introduces Swift Cross-Dataset Pruning (SCDP), a method that uses TF-IDF embeddings and geometric median calculations to efficiently prune datasets for task-specific fine-tuning, adapting strategies based on dataset size to maintain diversity while reducing computational costs.

Authors:Tara Radvand, Mojtaba Abdolmaleki, Mohamed Mostagir, Ambuj Tewari
Title: Zero-Shot Statistical Tests for LLM-Generated Text Detection using Finite Sample Concentration Inequalities
Abstract:
Verifying the provenance of content is crucial to the function of many organizations, e.g., educational institutions, social media platforms, firms, etc. This problem is becoming increasingly challenging as text generated by Large Language Models (LLMs) becomes almost indistinguishable from human-generated content. In addition, many institutions utilize in-house LLMs and want to ensure that external, non-sanctioned LLMs do not produce content within the institution. In this paper, we answer the following question: Given a piece of text, can we identify whether it was produced by a particular LLM or not? We model LLM-generated text as a sequential stochastic process with complete dependence on history. We then design zero-shot statistical tests to (i) distinguish between text generated by two different known sets of LLMs $A$ (non-sanctioned) and $B$ (in-house), and (ii) identify whether text was generated by a known LLM or generated by any unknown model, e.g., a human or some other language generation process. We prove that the type I and type II errors of our test decrease exponentially with the length of the text. For that, we show that if $B$ generates the text, then except with an exponentially small probability in string length, the log-perplexity of the string under $A$ converges to the average cross-entropy of $B$ and $A$. We then present experiments using LLMs with white-box access to support our theoretical results and empirically examine the robustness of our results to black-box settings and adversarial attacks. In the black-box setting, our method achieves an average TPR of 82.5\% at a fixed FPR of 5\%. Under adversarial perturbations, our minimum TPR is 48.6\% at the same FPR threshold. Both results outperform all non-commercial baselines. See https://github.com/TaraRadvand74/llm-text-detection for code, data, and an online demo of the project.
Chinese: 本文提出了一种通过将文本建模为随机过程并使用零样本统计测试来验证其是否由特定大型语言模型生成的方法,该方法在理论和实验中均表现出色,错误率随文本长度呈指数级下降,并在白盒和黑盒环境下均优于现有基线。
English: This paper introduces a method to verify whether text is generated by a specific large language model by modeling it as a stochastic process and using zero-shot statistical tests, with proven exponential error reduction and strong performance in both white-box and black-box settings.

Authors:Sichao Wang, Ming Yuan, Chuang Zhang, Qing Xu, Lei He, Jianqiang Wang
Title: V2X-DGPE: Addressing Domain Gaps and Pose Errors for Robust Collaborative 3D Object Detection
Abstract:
In V2X collaborative perception, the domain gaps between heterogeneous nodes pose a significant challenge for effective information fusion. Pose errors arising from latency and GPS localization noise further exacerbate the issue by leading to feature misalignment. To overcome these challenges, we propose V2X-DGPE, a high-accuracy and robust V2X feature-level collaborative perception framework. V2X-DGPE employs a Knowledge Distillation Framework and a Feature Compensation Module to learn domain-invariant representations from multi-source data, effectively reducing the feature distribution gap between vehicles and roadside infrastructure. Historical information is utilized to provide the model with a more comprehensive understanding of the current scene. Furthermore, a Collaborative Fusion Module leverages a heterogeneous self-attention mechanism to extract and integrate heterogeneous representations from vehicles and infrastructure. To address pose errors, V2X-DGPE introduces a deformable attention mechanism, enabling the model to adaptively focus on critical parts of the input features by dynamically offsetting sampling points. Extensive experiments on the real-world DAIR-V2X dataset demonstrate that the proposed method outperforms existing approaches, achieving state-of-the-art detection performance. The code is available at https://github.com/wangsch10/V2X-DGPE.
中文:提出的V2X-DGPE框架通过知识蒸馏、特征补偿和可变形注意力机制解决V2X协同感知中的域差异和位姿误差问题,在真实场景数据集中实现了最优性能。
English: The proposed V2X-DGPE framework addresses domain gaps and pose errors in V2X collaborative perception through knowledge distillation, feature compensation, and deformable attention mechanisms, achieving state-of-the-art performance on real-world datasets.

Authors:Ambroise Odonnat, Wassim Bouaziz, Vivien Cabannes
Title: Easing Optimization Paths: a Circuit Perspective
Abstract:
Gradient descent is the method of choice for training large artificial intelligence systems. As these systems become larger, a better understanding of the mechanisms behind gradient training would allow us to alleviate compute costs and help steer these systems away from harmful behaviors. To that end, we suggest utilizing the circuit perspective brought forward by mechanistic interpretability. After laying out our intuition, we illustrate how it enables us to design a curriculum for efficient learning in a controlled setting. The code is available at \url{https://github.com/facebookresearch/pal}.
中文: 梯度下降是训练大型人工智能系统的关键方法,通过机制可解释性理解其机制可以降低计算成本并避免有害行为,正如在受控环境中设计的课程所展示的那样。
English: Gradient descent is essential for training large AI systems, and understanding its mechanisms through mechanistic interpretability can reduce computational costs and prevent harmful behaviors, as demonstrated by a designed curriculum in a controlled setting.

Authors:Yonglin Tian, Fei Lin, Yiduo Li, Tengchao Zhang, Qiyao Zhang, Xuan Fu, Jun Huang, Xingyuan Dai, Yutong Wang, Chunwei Tian, Bai Li, Yisheng Lv, Levente Kovács, Fei-Yue Wang
Title: UAVs Meet LLMs: Overviews and Perspectives Toward Agentic Low-Altitude Mobility
Abstract:
Low-altitude mobility, exemplified by unmanned aerial vehicles (UAVs), has introduced transformative advancements across various domains, like transportation, logistics, and agriculture. Leveraging flexible perspectives and rapid maneuverability, UAVs extend traditional systems' perception and action capabilities, garnering widespread attention from academia and industry. However, current UAV operations primarily depend on human control, with only limited autonomy in simple scenarios, and lack the intelligence and adaptability needed for more complex environments and tasks. The emergence of large language models (LLMs) demonstrates remarkable problem-solving and generalization capabilities, offering a promising pathway for advancing UAV intelligence. This paper explores the integration of LLMs and UAVs, beginning with an overview of UAV systems' fundamental components and functionalities, followed by an overview of the state-of-the-art in LLM technology. Subsequently, it systematically highlights the multimodal data resources available for UAVs, which provide critical support for training and evaluation. Furthermore, it categorizes and analyzes key tasks and application scenarios where UAVs and LLMs converge. Finally, a reference roadmap towards agentic UAVs is proposed, aiming to enable UAVs to achieve agentic intelligence through autonomous perception, memory, reasoning, and tool utilization. Related resources are available at https://github.com/Hub-Tian/UAVs_Meet_LLMs.
Chinese Summary: 大型语言模型与无人机的结合为提升无人机智能提供了可行路径,使其能够在复杂环境中实现自主感知、推理与任务执行。
English Summary: The integration of large language models (LLMs) with unmanned aerial vehicles (UAVs) offers a promising pathway to enhance UAV intelligence, enabling autonomous perception, reasoning, and task execution in complex environments.

Authors:Liye Jia, Runwei Guan, Haocheng Zhao, Qiuchi Zhao, Ka Lok Man, Jeremy Smith, Limin Yu, Yutao Yue
Title: RadarNeXt: Real-Time and Reliable 3D Object Detector Based On 4D mmWave Imaging Radar
Abstract:
3D object detection is crucial for Autonomous Driving (AD) and Advanced Driver Assistance Systems (ADAS). However, most 3D detectors prioritize detection accuracy, often overlooking network inference speed in practical applications. In this paper, we propose RadarNeXt, a real-time and reliable 3D object detector based on the 4D mmWave radar point clouds. It leverages the re-parameterizable neural networks to catch multi-scale features, reduce memory cost and accelerate the inference. Moreover, to highlight the irregular foreground features of radar point clouds and suppress background clutter, we propose a Multi-path Deformable Foreground Enhancement Network (MDFEN), ensuring detection accuracy while minimizing the sacrifice of speed and excessive number of parameters. Experimental results on View-of-Delft and TJ4DRadSet datasets validate the exceptional performance and efficiency of RadarNeXt, achieving 50.48 and 32.30 mAPs with the variant using our proposed MDFEN. Notably, our RadarNeXt variants achieve inference speeds of over 67.10 FPS on the RTX A4000 GPU and 28.40 FPS on the Jetson AGX Orin. This research demonstrates that RadarNeXt brings a novel and effective paradigm for 3D perception based on 4D mmWave radar.
中文: RadarNeXt是一种基于4D毫米波雷达的实时3D物体检测器,通过多尺度特征提取和前景增强网络在精度与速度间取得平衡,在自动驾驶数据集中表现卓越。
English: RadarNeXt is a real-time 3D object detector using 4D mmWave radar that balances accuracy and speed through multi-scale feature extraction and a foreground enhancement network, achieving high performance on autonomous driving datasets.

Authors:Zongwei Li, Lianghao Xia, Hua Hua, Shijie Zhang, Shuangyang Wang, Chao Huang
Title: DiffGraph: Heterogeneous Graph Diffusion Model
Abstract:
Recent advances in Graph Neural Networks (GNNs) have revolutionized graph-structured data modeling, yet traditional GNNs struggle with complex heterogeneous structures prevalent in real-world scenarios. Despite progress in handling heterogeneous interactions, two fundamental challenges persist: noisy data significantly compromising embedding quality and learning performance, and existing methods' inability to capture intricate semantic transitions among heterogeneous relations, which impacts downstream predictions. To address these fundamental issues, we present the Heterogeneous Graph Diffusion Model (DiffGraph), a pioneering framework that introduces an innovative cross-view denoising strategy. This advanced approach transforms auxiliary heterogeneous data into target semantic spaces, enabling precise distillation of task-relevant information. At its core, DiffGraph features a sophisticated latent heterogeneous graph diffusion mechanism, implementing a novel forward and backward diffusion process for superior noise management. This methodology achieves simultaneous heterogeneous graph denoising and cross-type transition, while significantly simplifying graph generation through its latent-space diffusion capabilities. Through rigorous experimental validation on both public and industrial datasets, we demonstrate that DiffGraph consistently surpasses existing methods in link prediction and node classification tasks, establishing new benchmarks for robustness and efficiency in heterogeneous graph processing. The model implementation is publicly available at: https://github.com/HKUDS/DiffGraph.
中文摘要:本文提出的异质图扩散模型通过跨视图去噪策略和潜在扩散机制,有效解决了异质图中噪声干扰和语义转换的难题,在多项图学习任务中实现了最优性能。
English Summary: The proposed Heterogeneous Graph Diffusion Model (DiffGraph) addresses noise and semantic transition challenges in heterogeneous graphs through cross-view denoising and latent diffusion mechanisms, achieving state-of-the-art performance in graph learning tasks.

Authors:Mengting Wei, Tuomas Varanka, Xingxun Jiang, Huai-Qian Khor, Guoying Zhao
Title: MagicFace: High-Fidelity Facial Expression Editing with Action-Unit Control
Abstract:
We address the problem of facial expression editing by controling the relative variation of facial action-unit (AU) from the same person. This enables us to edit this specific person's expression in a fine-grained, continuous and interpretable manner, while preserving their identity, pose, background and detailed facial attributes. Key to our model, which we dub MagicFace, is a diffusion model conditioned on AU variations and an ID encoder to preserve facial details of high consistency. Specifically, to preserve the facial details with the input identity, we leverage the power of pretrained Stable-Diffusion models and design an ID encoder to merge appearance features through self-attention. To keep background and pose consistency, we introduce an efficient Attribute Controller by explicitly informing the model of current background and pose of the target. By injecting AU variations into a denoising UNet, our model can animate arbitrary identities with various AU combinations, yielding superior results in high-fidelity expression editing compared to other facial expression editing works. Code is publicly available at https://github.com/weimengting/MagicFace.
中文: MagicFace是一种基于扩散模型的精细化面部表情编辑方法,通过控制动作单元变化并利用ID编码器和属性控制器,在保持身份、姿态和背景一致性的同时实现连续可解释的表情编辑。
English: MagicFace is a diffusion model that enables fine-grained facial expression editing by controlling action-unit variations while preserving identity, pose, and background through an ID encoder and attribute controller.

Authors:Mian Zou, Baosheng Yu, Yibing Zhan, Kede Ma
Title: Self-Supervised Learning for Detecting AI-Generated Faces as Anomalies
Abstract:
The detection of AI-generated faces is commonly approached as a binary classification task. Nevertheless, the resulting detectors frequently struggle to adapt to novel AI face generators, which evolve rapidly. In this paper, we describe an anomaly detection method for AI-generated faces by leveraging self-supervised learning of camera-intrinsic and face-specific features purely from photographic face images. The success of our method lies in designing a pretext task that trains a feature extractor to rank four ordinal exchangeable image file format (EXIF) tags and classify artificially manipulated face images. Subsequently, we model the learned feature distribution of photographic face images using a Gaussian mixture model. Faces with low likelihoods are flagged as AI-generated. Both quantitative and qualitative experiments validate the effectiveness of our method. Our code is available at \url{https://github.com/MZMMSEC/AIGFD_EXIF.git}.
中文: 本文提出一种基于自监督学习的异常检测方法,通过从摄影人脸图像中学习相机固有和面部特征,有效识别AI生成的人脸,实验验证了其有效性。
English: This paper introduces an anomaly detection method for AI-generated faces using self-supervised learning of camera and facial features from photographic images, validated by experiments to effectively identify synthetic faces.

Authors:Zongxia Li, Xiyang Wu, Hongyang Du, Fuxiao Liu, Huy Nghiem, Guangyao Shi
Title: A Survey of State of the Art Large Vision Language Models: Alignment, Benchmark, Evaluations and Challenges
Abstract:
Multimodal Vision Language Models (VLMs) have emerged as a transformative topic at the intersection of computer vision and natural language processing, enabling machines to perceive and reason about the world through both visual and textual modalities. For example, models such as CLIP, Claude, and GPT-4V demonstrate strong reasoning and understanding abilities on visual and textual data and beat classical single modality vision models on zero-shot classification [93]. With their rapid advancements in research and growing popularity in various applications, we provide a comprehensive survey of VLMs. Specifically, we provide a systematic overview of VLMs in the following aspects: [1] model information of the major VLMs developed up to 2025; [2] the transition of VLM architectures and the newest VLM alignment methods; [3] summary and categorization of the popular benchmarks and evaluation metrics of VLMs; [4] the challenges and issues faced by current VLMs such as hallucination, alignment, fairness, and safety. Detailed collections including papers and model repository links are listed in https://github.com/zli12321/Vision-Language-Models-Overview.
中文: 多模态视觉语言模型融合计算机视觉与自然语言处理,使机器能通过视觉和文本模态感知与推理,本综述系统梳理了其发展历程、架构、评估基准及幻觉与安全等挑战。
English: Multimodal Vision Language Models (VLMs) integrate computer vision and natural language processing to enable machines to perceive and reason through visual and textual data, with this survey systematically reviewing their development, architectures, benchmarks, and challenges like hallucination and safety.

Authors:Juntao Zhang, Shaogeng Liu, Kun Bian, You Zhou, Pei Zhang, Jianning Liu, Jun Zhou, Bingyan Liu
Title: A Separable Self-attention Inspired by the State Space Model for Computer Vision
Abstract:
Mamba is an efficient State Space Model (SSM) with linear computational complexity. Although SSMs are not suitable for handling non-causal data, Vision Mamba (ViM) methods still demonstrate good performance in tasks such as image classification and object detection. Recent studies have shown that there is a rich theoretical connection between state space models and attention variants. We propose a novel separable self attention method, for the first time introducing some excellent design concepts of Mamba into separable self-attention. To ensure a fair comparison with ViMs, we introduce VMINet, a simple yet powerful prototype architecture, constructed solely by stacking our novel attention modules with the most basic down-sampling layers. Notably, VMINet differs significantly from the conventional Transformer architecture. Our experiments demonstrate that VMINet has achieved competitive results on image classification and high-resolution dense prediction tasks.Code is available at: https://github.com/yws-wxs/VMINet.
中文: VMINet提出了一种新颖的可分离自注意力方法,首次将Mamba的优秀设计理念融入其中,在图像分类和高分辨率密集预测任务中取得了有竞争力的结果,其架构简单且不同于传统Transformer。
English: VMINet introduces a novel separable self-attention method that incorporates Mamba's design concepts, achieving competitive performance in image classification and dense prediction tasks with a simple, non-Transformer architecture.

Authors:Benjamin Shiue-Hal Chou, Purvish Jajal, Nicholas John Eliopoulos, Tim Nadolsky, Cheng-Yun Yang, Nikita Ravi, James C. Davis, Kristen Yeon-Ji Yun, Yung-Hsiang Lu
Title: Detecting Music Performance Errors with Transformers
Abstract:
Beginner musicians often struggle to identify specific errors in their performances, such as playing incorrect notes or rhythms. There are two limitations in existing tools for music error detection: (1) Existing approaches rely on automatic alignment; therefore, they are prone to errors caused by small deviations between alignment targets.; (2) There is a lack of sufficient data to train music error detection models, resulting in over-reliance on heuristics. To address (1), we propose a novel transformer model, Polytune, that takes audio inputs and outputs annotated music scores. This model can be trained end-to-end to implicitly align and compare performance audio with music scores through latent space representations. To address (2), we present a novel data generation technique capable of creating large-scale synthetic music error datasets. Our approach achieves a 64.1% average Error Detection F1 score, improving upon prior work by 40 percentage points across 14 instruments. Additionally, compared with existing transcription methods repurposed for music error detection, our model can handle multiple instruments. Our source code and datasets are available at https://github.com/ben2002chou/Polytune.
中文摘要:Polytune 变换器模型通过端到端的音频与乐谱隐式对齐技术及合成数据生成,解决了音乐错误检测中的对齐依赖和数据不足问题,在14种乐器上实现64.1%的F1分数并支持多乐器处理。
English Summary: The Polytune transformer model addresses limitations in music error detection by enabling end-to-end audio-to-score alignment and generating synthetic datasets, achieving a 64.1% F1 score with multi-instrument capability.

Authors:Ziwei Zheng, Junyao Zhao, Le Yang, Lijun He, Fan Li
Title: Spot Risks Before Speaking! Unraveling Safety Attention Heads in Large Vision-Language Models
Abstract:
With the integration of an additional modality, large vision-language models (LVLMs) exhibit greater vulnerability to safety risks (e.g., jailbreaking) compared to their language-only predecessors. Although recent studies have devoted considerable effort to the post-hoc alignment of LVLMs, the inner safety mechanisms remain largely unexplored. In this paper, we discover that internal activations of LVLMs during the first token generation can effectively identify malicious prompts across different attacks. This inherent safety perception is governed by sparse attention heads, which we term ``safety heads." Further analysis reveals that these heads act as specialized shields against malicious prompts; ablating them leads to higher attack success rates, while the model's utility remains unaffected. By locating these safety heads and concatenating their activations, we construct a straightforward but powerful malicious prompt detector that integrates seamlessly into the generation process with minimal extra inference overhead. Despite its simple structure of a logistic regression model, the detector surprisingly exhibits strong zero-shot generalization capabilities. Experiments across various prompt-based attacks confirm the effectiveness of leveraging safety heads to protect LVLMs. Code is available at \url{https://github.com/Ziwei-Zheng/SAHs}.
中文: 大型视觉语言模型比纯文本模型更易受安全风险威胁,但其内部的"安全头部"能在首个令牌生成时有效识别恶意提示,从而以最小计算成本实现高效防护。
English: Large vision-language models are more vulnerable to safety risks than text-only models, but their internal "safety heads" can detect malicious prompts during the first token generation, enabling effective protection with minimal computational cost.

Authors:Hwa Hui Tew, Fan Ding, Gaoxuan Li, Junn Yong Loo, Chee-Ming Ting, Ze Yang Ding, Chee Pin Tan
Title: ST-HCSS: Deep Spatio-Temporal Hypergraph Convolutional Neural Network for Soft Sensing
Abstract:
Higher-order sensor networks are more accurate in characterizing the nonlinear dynamics of sensory time-series data in modern industrial settings by allowing multi-node connections beyond simple pairwise graph edges. In light of this, we propose a deep spatio-temporal hypergraph convolutional neural network for soft sensing (ST-HCSS). In particular, our proposed framework is able to construct and leverage a higher-order graph (hypergraph) to model the complex multi-interactions between sensor nodes in the absence of prior structural knowledge. To capture rich spatio-temporal relationships underlying sensor data, our proposed ST-HCSS incorporates stacked gated temporal and hypergraph convolution layers to effectively aggregate and update hypergraph information across time and nodes. Our results validate the superiority of ST-HCSS compared to existing state-of-the-art soft sensors, and demonstrates that the learned hypergraph feature representations aligns well with the sensor data correlations. The code is available at https://github.com/htew0001/ST-HCSS.git
中文: 提出的深度时空超图卷积神经网络(ST-HCSS)通过超图结构有效建模传感器间的复杂交互并捕捉丰富的时空关系,验证了其优于现有软传感器的性能表现。
English: The proposed deep spatio-temporal hypergraph convolutional neural network (ST-HCSS) effectively models complex sensor interactions through hypergraph structures and captures rich spatio-temporal relationships, demonstrating superior performance over existing soft sensors.

Authors:Keng Hou Leong, Yuxuan Xiu, Wai Kin, Chan
Title: Information Subtraction: Learning Representations for Conditional Entropy
Abstract:
The representations of conditional entropy and conditional mutual information are significant in explaining the unique effects among variables. While previous studies based on conditional contrastive sampling have effectively removed information regarding discrete sensitive variables, they have not yet extended their scope to continuous cases. This paper introduces Information Subtraction, a framework designed to generate representations that preserve desired information while eliminating the undesired. We implement a generative-based architecture that outputs these representations by simultaneously maximizing an information term and minimizing another. With its flexibility in disentangling information, we can iteratively apply Information Subtraction to represent arbitrary information components between continuous variables, thereby explaining the various relationships that exist between them. Our results highlight the representations' ability to provide semantic features of conditional entropy. By subtracting sensitive and domain-specific information, our framework demonstrates effective performance in fair learning and domain generalization. The code for this paper is available at https://github.com/jh-liang/Information-Subtraction
中文摘要:本文提出的信息减法框架通过最大化目标信息和最小化无关信息来生成表征,能够有效分离连续变量间的信息成分,在公平学习和领域泛化中表现出色。
English Summary: This paper introduces Information Subtraction, a flexible framework that generates representations by maximizing desired information and minimizing unwanted information, enabling effective performance in fair learning and domain generalization by disentangling information components between continuous variables.

Authors:Delin An, Chaoli Wang
Title: SurfPatch: Enabling Patch Matching for Exploratory Stream Surface Visualization
Abstract:
Unlike their line-based counterparts, surface-based techniques have yet to be thoroughly investigated in flow visualization due to their significant placement, speed, perception, and evaluation challenges. This paper presents SurfPatch, a novel framework supporting exploratory stream surface visualization. To begin with, we translate the issue of surface placement to surface selection and trace a large number of stream surfaces from a given flow field dataset. Then, we introduce a three-stage process: vertex-level classification, patch-level matching, and surface-level clustering that hierarchically builds the connection between vertices and patches and between patches and surfaces. This bottom-up approach enables fine-grained, multiscale patch-level matching, sharply contrasts surface-level matching offered by existing works, and provides previously unavailable flexibility during querying. We design an intuitive visual interface for users to conveniently visualize and analyze the underlying collection of stream surfaces in an exploratory manner. SurfPatch is not limited to stream surfaces traced from steady flow datasets. We demonstrate its effectiveness through experiments on stream surfaces produced from steady and unsteady flows as well as isosurfaces extracted from scalar fields. The code is available at https://github.com/adlsn/SurfPatch.
中文摘要:本文提出SurfPatch这一创新框架,通过分层多尺度面片匹配解决了基于表面的流场可视化难题,并借助直观的可视界面实现了灵活的探索式分析。
English Summary: This paper introduces SurfPatch, a novel framework that addresses the challenges in surface-based flow visualization by enabling hierarchical, multiscale patch-level matching and providing flexible querying through an intuitive visual interface.

Authors:Tianyu Fu, Tengxuan Liu, Qinghao Han, Guohao Dai, Shengen Yan, Huazhong Yang, Xuefei Ning, Yu Wang
Title: FrameFusion: Combining Similarity and Importance for Video Token Reduction on Large Vision Language Models
Abstract:
The increasing demand to process long and high-resolution videos significantly burdens Large Vision-Language Models (LVLMs) due to the enormous number of visual tokens. Existing token reduction methods primarily prune tokens based on importance metrics, such as cumulative attention scores. However, even important tokens may exhibit high redundancy caused by similarity among adjacent video frames and repetitive visual elements. To address this limitation, we propose FrameFusion, a novel token reduction approach integrating similarity-based merging with importance-based pruning. We conduct a thorough study on token similarity characteristics, revealing three key insights: (1) spatially corresponding visual tokens between adjacent frames have higher cosine similarities compared to other token pairs; (2) high token similarities prominently decrease in deeper model layers; and (3) token similarity rankings are highly consistent across different layers. Guided by these observations, FrameFusion computes token similarities exclusively between corresponding visual tokens from adjacent frames, applies token merging at initial successive layers followed by pruning in deeper layers, and adopts a cascaded merging strategy to further enhance efficiency. We evaluate FrameFusion comprehensively across six diverse LVLMs, ranging from 2B to 72B parameters, using five video benchmarks encompassing video retrieval, question-answering, and spatial-temporal understanding tasks. Experiments show that FrameFusion reduces visual tokens by 70%, achieving 1.6-3.6x end-to-end speedups, with an average performance impact of less than 3%. Our code is available at: https://github.com/thu-nics/FrameFusion.
中文:FrameFusion通过结合基于相似性的令牌合并和基于重要性的剪枝,将大型视觉语言模型中的视觉令牌减少70%,在实现显著加速的同时保持了性能损失最小。
English: FrameFusion combines similarity-based token merging with importance-based pruning to reduce visual tokens by 70% in Large Vision-Language Models, achieving significant speed improvements with minimal performance loss.

Authors:Atharva Divekar, Atharva Sonawane
Title: Leveraging AI for Automatic Classification of PCOS Using Ultrasound Imaging
Abstract:
The AUTO-PCOS Classification Challenge seeks to advance the diagnostic capabilities of artificial intelligence (AI) in identifying Polycystic Ovary Syndrome (PCOS) through automated classification of healthy and unhealthy ultrasound frames. This report outlines our methodology for building a robust AI pipeline utilizing transfer learning with the InceptionV3 architecture to achieve high accuracy in binary classification. Preprocessing steps ensured the dataset was optimized for training, validation, and testing, while interpretability methods like LIME and saliency maps provided valuable insights into the model's decision-making. Our approach achieved an accuracy of 90.52%, with precision, recall, and F1-score metrics exceeding 90% on validation data, demonstrating its efficacy. The project underscores the transformative potential of AI in healthcare, particularly in addressing diagnostic challenges like PCOS. Key findings, challenges, and recommendations for future enhancements are discussed, highlighting the pathway for creating reliable, interpretable, and scalable AI-driven medical diagnostic tools.
Chinese: AUTO-PCOS分类挑战通过InceptionV3迁移学习构建的AI模型在多囊卵巢综合征超声图像分类中准确率超过90%,展现了人工智能在医疗诊断领域的变革潜力。
English: The AUTO-PCOS Classification Challenge developed an AI pipeline using InceptionV3 transfer learning that achieved over 90% accuracy in classifying PCOS ultrasound images, demonstrating AI's potential to transform medical diagnostics.

Authors:Jiahao Qin, Feng Liu
Title: GAF-FusionNet: Multimodal ECG Analysis via Gramian Angular Fields and Split Attention
Abstract:
Electrocardiogram (ECG) analysis plays a crucial role in diagnosing cardiovascular diseases, but accurate interpretation of these complex signals remains challenging. This paper introduces a novel multimodal framework(GAF-FusionNet) for ECG classification that integrates time-series analysis with image-based representation using Gramian Angular Fields (GAF). Our approach employs a dual-layer cross-channel split attention module to adaptively fuse temporal and spatial features, enabling nuanced integration of complementary information. We evaluate GAF-FusionNet on three diverse ECG datasets: ECG200, ECG5000, and the MIT-BIH Arrhythmia Database. Results demonstrate significant improvements over state-of-the-art methods, with our model achieving 94.5\%, 96.9\%, and 99.6\% accuracy on the respective datasets. Our code will soon be available at https://github.com/Cross-Innovation-Lab/GAF-FusionNet.git.
中文摘要:本文提出GAF-FusionNet多模态心电图分类框架,通过格拉米角场融合时序分析与图像表征,在三个基准数据集上实现了最先进的分类准确率。
English Summary: This paper presents GAF-FusionNet, a multimodal ECG classification framework that combines time-series analysis with image-based representations using Gramian Angular Fields, achieving state-of-the-art accuracy across three benchmark datasets.

Authors:Chaoyou Fu, Haojia Lin, Xiong Wang, Yi-Fan Zhang, Yunhang Shen, Xiaoyu Liu, Haoyu Cao, Zuwei Long, Heting Gao, Ke Li, Long Ma, Xiawu Zheng, Rongrong Ji, Xing Sun, Caifeng Shan, Ran He
Title: VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
Abstract:
Recent Multimodal Large Language Models (MLLMs) have typically focused on integrating visual and textual modalities, with less emphasis placed on the role of speech in enhancing interaction. However, speech plays a crucial role in multimodal dialogue systems, and implementing high-performance in both vision and speech tasks remains a significant challenge due to the fundamental modality differences. In this paper, we propose a carefully designed multi-stage training methodology that progressively trains LLM to understand both visual and speech information, ultimately enabling fluent vision and speech interaction. Our approach not only preserves strong vision-language capacity, but also enables efficient speech-to-speech dialogue capabilities without separate ASR and TTS modules, significantly accelerating multimodal end-to-end response speed. By comparing our method against state-of-the-art counterparts across benchmarks for image, video, and speech tasks, we demonstrate that our model is equipped with both strong visual and speech capabilities, making near real-time vision and speech interaction.
中文: 本文提出一种多阶段训练方法,使多模态大语言模型具备强大的视觉与语音交互能力,无需独立语音模块即可实现高效端到端对话,同时在视觉与语音任务中保持优异性能。
English: This paper introduces a multi-stage training method that equips multimodal large language models with strong visual and speech interaction capabilities, enabling efficient end-to-end dialogue without separate speech modules while maintaining high performance across vision and speech benchmarks.

Authors:Tianyu Gao, Alexander Wettig, Luxi He, Yihe Dong, Sadhika Malladi, Danqi Chen
Title: Metadata Conditioning Accelerates Language Model Pre-training
Abstract:
The vast diversity of styles, domains, and quality levels present in language model pre-training corpora is essential in developing general model capabilities, but efficiently learning and deploying the correct behaviors exemplified in each of these heterogeneous data sources is challenging. To address this, we propose a new method, termed Metadata Conditioning then Cooldown (MeCo), to incorporate additional learning cues during pre-training. MeCo first provides metadata (e.g., URLs like www$.$wikipedia$.$org) alongside the text during training and later uses a cooldown phase with only the standard text, thereby enabling the model to function normally even without metadata. MeCo significantly accelerates pre-training across different model scales (600M to 8B parameters) and training sources (C4, RefinedWeb, and DCLM). For instance, a 1.6B language model trained with MeCo matches the downstream task performance of standard pre-training while using 33% less data. Additionally, MeCo enables us to steer language models by conditioning the inference prompt on either real or fabricated metadata that encodes the desired properties of the output: for example, prepending wikipedia$.$org to reduce harmful generations or factquizmaster$.$com (fabricated) to improve common knowledge task performance. We also demonstrate that MeCo is compatible with different types of metadata, such as model-generated topics. MeCo is remarkably simple, adds no computational overhead, and demonstrates promise in producing more capable and steerable language models.
中文:MeCo方法通过在预训练中先结合元数据与文本再进行纯文本冷却阶段,显著提升了训练效率并实现了无需额外计算开销的模型引导能力。
English: The MeCo method enhances pre-training by initially using metadata alongside text and then transitioning to text-only training, significantly improving efficiency and enabling model steering without added computational cost.

Authors:Weizhi Zhang, Yuanchen Bei, Liangwei Yang, Henry Peng Zou, Peilin Zhou, Aiwei Liu, Yinghui Li, Hao Chen, Jianling Wang, Yu Wang, Feiran Huang, Sheng Zhou, Jiajun Bu, Allen Lin, James Caverlee, Fakhri Karray, Irwin King, Philip S. Yu
Title: Cold-Start Recommendation towards the Era of Large Language Models (LLMs): A Comprehensive Survey and Roadmap
Abstract:
Cold-start problem is one of the long-standing challenges in recommender systems, focusing on accurately modeling new or interaction-limited users or items to provide better recommendations. Due to the diversification of internet platforms and the exponential growth of users and items, the importance of cold-start recommendation (CSR) is becoming increasingly evident. At the same time, large language models (LLMs) have achieved tremendous success and possess strong capabilities in modeling user and item information, providing new potential for cold-start recommendations. However, the research community on CSR still lacks a comprehensive review and reflection in this field. Based on this, in this paper, we stand in the context of the era of large language models and provide a comprehensive review and discussion on the roadmap, related literature, and future directions of CSR. Specifically, we have conducted an exploration of the development path of how existing CSR utilizes information, from content features, graph relations, and domain information, to the world knowledge possessed by large language models, aiming to provide new insights for both the research and industrial communities on CSR. Related resources of cold-start recommendations are collected and continuously updated for the community in https://github.com/YuanchenBei/Awesome-Cold-Start-Recommendation.
中文摘要:本文在大语言模型时代背景下全面回顾冷启动推荐研究的发展路径,从传统内容特征到LLM的世界知识利用,并探讨了未来研究方向。
English Summary: This paper provides a comprehensive review of cold-start recommendation research, analyzing its evolution from traditional content features to large language models' world knowledge while discussing future directions.

Authors:Jiaming Li, Jiacheng Zhang, Zequn Jie, Lin Ma, Guanbin Li
Title: Mitigating Hallucination for Large Vision Language Model by Inter-Modality Correlation Calibration Decoding
Abstract:
Large vision-language models (LVLMs) have shown remarkable capabilities in visual-language understanding for downstream multi-modal tasks. Despite their success, LVLMs still suffer from generating hallucinations in complex generation tasks, leading to inconsistencies between visual inputs and generated content. To address this issue, some approaches have introduced inference-time interventions, such as contrastive decoding and attention rectification, to reduce overreliance on language priors. However, these approaches overlook hallucinations stemming from spurious inter-modality correlations. In this paper, we propose an Inter-Modality Correlation Calibration Decoding (IMCCD) method to mitigate hallucinations in LVLMs in a training-free manner. In this method, we design a Cross-Modal Value-Enhanced Decoding(CMVED) module to alleviate hallucination by a novel contrastive decoding mechanism. During the estimation of distorted distribution, CMVED masks the value vectors associated with significant cross-modal attention weights, which address both uni-modality overreliance and misleading inter-modality correlations. Additionally, a Content-Driven Attention Refinement(CDAR) module refines cross-modal attention weights, guiding LVLMs to focus on important visual content. Experimental results on diverse hallucination benchmarks validate the superiority of our method over existing state-of-the-art techniques in reducing hallucinations in LVLM text generation. Our code will be available at https://github.com/lijm48/IMCCD.
Chinese: 本文提出了一种跨模态相关性校准解码方法(IMCCD),通过跨模态对比解码和注意力优化来减少大型视觉语言模型中的幻觉问题,并在多个基准测试中验证了其优越性。
English: The paper introduces an Inter-Modality Correlation Calibration Decoding (IMCCD) method to reduce hallucinations in large vision-language models by employing cross-modal contrastive decoding and attention refinement, validated as superior on benchmarks.

Authors:Yifan Du, Zikang Liu, Yifan Li, Wayne Xin Zhao, Yuqi Huo, Bingning Wang, Weipeng Chen, Zheng Liu, Zhongyuan Wang, Ji-Rong Wen
Title: Virgo: A Preliminary Exploration on Reproducing o1-like MLLM
Abstract:
Recently, slow-thinking reasoning systems, built upon large language models (LLMs), have garnered widespread attention by scaling the thinking time during inference. There is also growing interest in adapting this capability to multimodal large language models (MLLMs). Given that MLLMs handle more complex data semantics across different modalities, it is intuitively more challenging to implement multimodal slow-thinking systems. To address this issue, in this paper, we explore a straightforward approach by fine-tuning a capable MLLM with a small amount of textual long-form thought data, resulting in a multimodal slow-thinking system, Virgo (Visual reasoning with long thought). We find that these long-form reasoning processes, expressed in natural language, can be effectively transferred to MLLMs. Moreover, it seems that such textual reasoning data can be even more effective than visual reasoning data in eliciting the slow-thinking capacities of MLLMs. While this work is preliminary, it demonstrates that slow-thinking capacities are fundamentally associated with the language model component, which can be transferred across modalities or domains. This finding can be leveraged to guide the development of more powerful slow-thinking reasoning systems. We release our resources at https://github.com/RUCAIBox/Virgo.
中文摘要:本文通过用少量文本长思考数据微调多模态大语言模型,开发了多模态慢思考系统Virgo,证明基于语言的推理能有效迁移慢思考能力,且文本推理数据比视觉数据更能激发模型的深度思考潜能。
English Summary: This paper introduces Virgo, a multimodal slow-thinking system created by fine-tuning a multimodal large language model with textual long-form thought data, demonstrating that language-based reasoning can effectively transfer slow-thinking capabilities across modalities.

Authors:Huaxiang Zhang, Kai Liu, Zhongxue Gan, Guo-Niu Zhu
Title: UAV-DETR: Efficient End-to-End Object Detection for Unmanned Aerial Vehicle Imagery
Abstract:
Unmanned aerial vehicle object detection (UAV-OD) has been widely used in various scenarios. However, most existing UAV-OD algorithms rely on manually designed components, which require extensive tuning. End-to-end models that do not depend on such manually designed components are mainly designed for natural images, which are less effective for UAV imagery. To address such challenges, this paper proposes an efficient detection transformer (DETR) framework tailored for UAV imagery, i.e., UAV-DETR. The framework includes a multi-scale feature fusion with frequency enhancement module, which captures both spatial and frequency information at different scales. In addition, a frequency-focused down-sampling module is presented to retain critical spatial details during down-sampling. A semantic alignment and calibration module is developed to align and fuse features from different fusion paths. Experimental results demonstrate the effectiveness and generalization of our approach across various UAV imagery datasets. On the VisDrone dataset, our method improves AP by 3.1\% and $\text{AP}_{50}$ by 4.2\% over the baseline. Similar enhancements are observed on the UAVVaste dataset. The project page: https://github.com/ValiantDiligent/UAV-DETR
Chinese: 本文提出专为无人机图像设计的UAV-DETR检测框架,通过多尺度频率增强特征融合和语义对齐模块,在多个无人机数据集上显著提升了检测性能,相比基线方法AP指标提升超过3%。
English: This paper introduces UAV-DETR, an efficient detection transformer framework specifically designed for UAV imagery, incorporating multi-scale feature fusion with frequency enhancement and semantic alignment modules to significantly improve object detection performance on drone datasets.

Authors:Aobo Kong, Wentao Ma, Shiwan Zhao, Yongbin Li, Yuchuan Wu, Ke Wang, Xiaoqian Liu, Qicheng Li, Yong Qin, Fei Huang
Title: SDPO: Segment-Level Direct Preference Optimization for Social Agents
Abstract:
Social agents powered by large language models (LLMs) can simulate human social behaviors but fall short in handling complex social dialogues. Direct Preference Optimization (DPO) has proven effective in aligning LLM behavior with human preferences across various agent tasks. However, standard DPO focuses solely on individual turns, which limits its effectiveness in multi-turn social interactions. Several DPO-based multi-turn alignment methods with session-level data have shown potential in addressing this problem.While these methods consider multiple turns across entire sessions, they are often overly coarse-grained, introducing training noise, and lack robust theoretical support. To resolve these limitations, we propose Segment-Level Direct Preference Optimization (SDPO), which dynamically select key segments within interactions to optimize multi-turn agent behavior. SDPO minimizes training noise and is grounded in a rigorous theoretical framework. Evaluations on the SOTOPIA benchmark demonstrate that SDPO-tuned agents consistently outperform both existing DPO-based methods and proprietary LLMs like GPT-4o, underscoring SDPO's potential to advance the social intelligence of LLM-based agents. We release our code and data at https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/SDPO.
Chinese Summary: 提出的分段级直接偏好优化(SDPO)方法通过动态选择关键交互片段来优化多轮智能体行为,在SOTOPIA基准测试中优于现有方法和GPT-4o,同时最大程度减少了训练噪声。
English Summary: The proposed Segment-Level Direct Preference Optimization (SDPO) method dynamically selects key interaction segments to optimize multi-turn agent behavior, outperforming existing methods and GPT-4o on the SOTOPIA benchmark while minimizing training noise.

Authors:Hu Ding, Yan Yan, Yang Lu, Jing-Hao Xue, Hanzi Wang
Title: Uncertainty-Aware Label Refinement on Hypergraphs for Personalized Federated Facial Expression Recognition
Abstract:
Most facial expression recognition (FER) models are trained on large-scale expression data with centralized learning. Unfortunately, collecting a large amount of centralized expression data is difficult in practice due to privacy concerns of facial images. In this paper, we investigate FER under the framework of personalized federated learning, which is a valuable and practical decentralized setting for real-world applications. To this end, we develop a novel uncertainty-Aware label refineMent on hYpergraphs (AMY) method. For local training, each local model consists of a backbone, an uncertainty estimation (UE) block, and an expression classification (EC) block. In the UE block, we leverage a hypergraph to model complex high-order relationships between expression samples and incorporate these relationships into uncertainty features. A personalized uncertainty estimator is then introduced to estimate reliable uncertainty weights of samples in the local client. In the EC block, we perform label propagation on the hypergraph, obtaining high-quality refined labels for retraining an expression classifier. Based on the above, we effectively alleviate heterogeneous sample uncertainty across clients and learn a robust personalized FER model in each client. Experimental results on two challenging real-world facial expression databases show that our proposed method consistently outperforms several state-of-the-art methods. This indicates the superiority of hypergraph modeling for uncertainty estimation and label refinement on the personalized federated FER task. The source code will be released at https://github.com/mobei1006/AMY.
中文: 本文提出AMY方法,通过超图建模实现标签优化和样本不确定性评估,在保护数据隐私的个性化联邦学习框架下显著提升了面部表情识别性能。
English: This paper introduces AMY, a personalized federated learning method that uses hypergraph modeling to refine labels and estimate sample uncertainty, achieving superior facial expression recognition performance while addressing data privacy concerns.

Authors:Nouran Khallaf, Carlo Eugeni, Serge Sharoff
Title: Reading Between the Lines: A dataset and a study on why some texts are tougher than others
Abstract:
Our research aims at better understanding what makes a text difficult to read for specific audiences with intellectual disabilities, more specifically, people who have limitations in cognitive functioning, such as reading and understanding skills, an IQ below 70, and challenges in conceptual domains. We introduce a scheme for the annotation of difficulties which is based on empirical research in psychology as well as on research in translation studies. The paper describes the annotated dataset, primarily derived from the parallel texts (standard English and Easy to Read English translations) made available online. we fine-tuned four different pre-trained transformer models to perform the task of multiclass classification to predict the strategies required for simplification. We also investigate the possibility to interpret the decisions of this language model when it is aimed at predicting the difficulty of sentences. The resources are available from https://github.com/Nouran-Khallaf/why-tough
中文摘要:本研究针对智力障碍读者的文本难度因素,结合心理学与翻译研究开发了标注方案,并利用Transformer模型预测简化策略,同时探索模型决策机制的可解释性。
English Summary: This study investigates text difficulty factors for intellectually disabled readers by developing an annotation scheme based on psychological and translation research, and employs transformer models to predict simplification strategies while interpreting their decision-making processes.

Authors:Zhengcong Fei, Debang Li, Di Qiu, Changqian Yu, Mingyuan Fan
Title: Ingredients: Blending Custom Photos with Video Diffusion Transformers
Abstract:
This paper presents a powerful framework to customize video creations by incorporating multiple specific identity (ID) photos, with video diffusion Transformers, referred to as Ingredients. Generally, our method consists of three primary modules: (i) a facial extractor that captures versatile and precise facial features for each human ID from both global and local perspectives; (ii) a multi-scale projector that maps face embeddings into the contextual space of image query in video diffusion transformers; (iii) an ID router that dynamically combines and allocates multiple ID embedding to the corresponding space-time regions. Leveraging a meticulously curated text-video dataset and a multi-stage training protocol, Ingredients demonstrates superior performance in turning custom photos into dynamic and personalized video content. Qualitative evaluations highlight the advantages of proposed method, positioning it as a significant advancement toward more effective generative video control tools in Transformer-based architecture, compared to existing methods. The data, code, and model weights are publicly available at: https://github.com/feizc/Ingredients.
中文摘要:本文提出了名为Ingredients的框架,通过面部提取、多尺度投影和身份路由三个核心模块,利用视频扩散Transformer将多张定制照片转化为动态个性化视频内容,在生成视频控制方面实现了显著进步。
English Summary: This paper introduces a framework called Ingredients that uses video diffusion Transformers to create customized videos from multiple identity photos through three specialized modules for facial feature extraction, embedding projection, and dynamic ID allocation.

Authors:Ruikang Chen, Yan Yan, Jing-Hao Xue, Yang Lu, Hanzi Wang
Title: Augmentation Matters: A Mix-Paste Method for X-Ray Prohibited Item Detection under Noisy Annotations
Abstract:
Automatic X-ray prohibited item detection is vital for public safety. Existing deep learning-based methods all assume that the annotations of training X-ray images are correct. However, obtaining correct annotations is extremely hard if not impossible for large-scale X-ray images, where item overlapping is ubiquitous.As a result, X-ray images are easily contaminated with noisy annotations, leading to performance deterioration of existing methods.In this paper, we address the challenging problem of training a robust prohibited item detector under noisy annotations (including both category noise and bounding box noise) from a novel perspective of data augmentation, and propose an effective label-aware mixed patch paste augmentation method (Mix-Paste). Specifically, for each item patch, we mix several item patches with the same category label from different images and replace the original patch in the image with the mixed patch. In this way, the probability of containing the correct prohibited item within the generated image is increased. Meanwhile, the mixing process mimics item overlapping, enabling the model to learn the characteristics of X-ray images. Moreover, we design an item-based large-loss suppression (LLS) strategy to suppress the large losses corresponding to potentially positive predictions of additional items due to the mixing operation. We show the superiority of our method on X-ray datasets under noisy annotations. In addition, we evaluate our method on the noisy MS-COCO dataset to showcase its generalization ability. These results clearly indicate the great potential of data augmentation to handle noise annotations. The source code is released at https://github.com/wscds/Mix-Paste.
中文摘要:本文提出Mix-Paste方法,通过混合同类物品图像块的数据增强技术,有效解决X射线图像在噪声标注下的违禁品检测问题,并采用损失抑制策略提升模型在标注噪声环境中的鲁棒性。
English Summary: This paper introduces Mix-Paste, a label-aware data augmentation method that enhances robust prohibited item detection in X-ray images by mixing item patches to handle noisy annotations and mimic item overlapping, complemented by a loss suppression strategy to improve performance under noisy training conditions.

Authors:Jiajun Cao, Yuan Zhang, Tao Huang, Ming Lu, Qizhe Zhang, Ruichuan An, Ningning MA, Shanghang Zhang
Title: MoVE-KD: Knowledge Distillation for VLMs with Mixture of Visual Encoders
Abstract:
Visual encoders are fundamental components in vision-language models (VLMs), each showcasing unique strengths derived from various pre-trained visual foundation models. To leverage the various capabilities of these encoders, recent studies incorporate multiple encoders within a single VLM, leading to a considerable increase in computational cost. In this paper, we present Mixture-of-Visual-Encoder Knowledge Distillation (MoVE-KD), a novel framework that distills the unique proficiencies of multiple vision encoders into a single, efficient encoder model. Specifically, to mitigate conflicts and retain the unique characteristics of each teacher encoder, we employ low-rank adaptation (LoRA) and mixture-of-experts (MoEs) to selectively activate specialized knowledge based on input features, enhancing both adaptability and efficiency. To regularize the KD process and enhance performance, we propose an attention-based distillation strategy that adaptively weighs the different encoders and emphasizes valuable visual tokens, reducing the burden of replicating comprehensive but distinct features from multiple teachers. Comprehensive experiments on popular VLMs, such as LLaVA and LLaVA-NeXT, validate the effectiveness of our method. Our code is available at: https://github.com/hey-cjj/MoVE-KD.
中文摘要:本文提出MoVE-KD框架,通过低秩适应和专家混合机制选择性地激活不同视觉编码器的专长知识,并采用基于注意力的蒸馏策略自适应权衡编码器价值,将多种视觉编码器的优势高效蒸馏到单一模型中。
English Summary: The paper introduces MoVE-KD, a framework that distills the diverse capabilities of multiple visual encoders into a single efficient model using LoRA and mixture-of-experts to selectively activate specialized knowledge while employing attention-based distillation to adaptively weigh encoders and emphasize valuable visual tokens.

Authors:Fengrui Zhang, Yujia Yin, Hongzong Li, Yifan Chen, Tianyi Qu
Title: Catch Causal Signals from Edges for Label Imbalance in Graph Classification
Abstract:
Despite significant advancements in causal research on graphs and its application to cracking label imbalance, the role of edge features in detecting the causal effects within graphs has been largely overlooked, leaving existing methods with untapped potential for further performance gains. In this paper, we enhance the causal attention mechanism through effectively leveraging edge information to disentangle the causal subgraph from the original graph, as well as further utilizing edge features to reshape graph representations. Capturing more comprehensive causal signals, our design leads to improved performance on graph classification tasks with label imbalance issues. We evaluate our approach on real-word datasets PTC, Tox21, and ogbg-molhiv, observing improvements over baselines. Overall, we highlight the importance of edge features in graph causal detection and provide a promising direction for addressing label imbalance challenges in graph-level tasks. The model implementation details and the codes are available on https://github.com/fengrui-z/ECAL
中文摘要:本研究通过将边特征融入因果注意力机制,增强了图因果检测能力,有效解决了标签不平衡问题,并在真实数据集上提升了分类性能。
English Summary: This study enhances graph causal detection by integrating edge features into the causal attention mechanism, effectively addressing label imbalance and improving classification performance on real-world datasets.

Authors:Jina Kim, Jihoo Lee, Je-Won Kang
Title: SNeRV: Spectra-preserving Neural Representation for Video
Abstract:
Neural representation for video (NeRV), which employs a neural network to parameterize video signals, introduces a novel methodology in video representations. However, existing NeRV-based methods have difficulty in capturing fine spatial details and motion patterns due to spectral bias, in which a neural network learns high-frequency (HF) components at a slower rate than low-frequency (LF) components. In this paper, we propose spectra-preserving NeRV (SNeRV) as a novel approach to enhance implicit video representations by efficiently handling various frequency components. SNeRV uses 2D discrete wavelet transform (DWT) to decompose video into LF and HF features, preserving spatial structures and directly addressing the spectral bias issue. To balance the compactness, we encode only the LF components, while HF components that include fine textures are generated by a decoder. Specialized modules, including a multi-resolution fusion unit (MFU) and a high-frequency restorer (HFR), are integrated into a backbone to facilitate the representation. Furthermore, we extend SNeRV to effectively capture temporal correlations between adjacent video frames, by casting the extension as additional frequency decomposition to a temporal domain. This approach allows us to embed spatio-temporal LF features into the network, using temporally extended up-sampling blocks (TUBs). Experimental results demonstrate that SNeRV outperforms existing NeRV models in capturing fine details and achieves enhanced reconstruction, making it a promising approach in the field of implicit video representations. The codes are available at https://github.com/qwertja/SNeRV.
中文摘要:SNeRV提出了一种新颖的隐式视频表示方法,通过二维小波变换和专用模块解决频谱偏差问题,能更好地捕捉细节和运动模式,在重建质量上超越了现有NeRV模型。
English Summary: SNeRV introduces a novel implicit video representation method that uses 2D wavelet transform and specialized modules to overcome spectral bias, enabling better capture of fine details and motion patterns while outperforming existing NeRV models in reconstruction quality.

Authors:Tengfei Wang, Xin Wang, Yongmao Hou, Yiwei Xu, Wendi Zhang, Zongqian Zhan
Title: PG-SAG: Parallel Gaussian Splatting for Fine-Grained Large-Scale Urban Buildings Reconstruction via Semantic-Aware Grouping
Abstract:
3D Gaussian Splatting (3DGS) has emerged as a transformative method in the field of real-time novel synthesis. Based on 3DGS, recent advancements cope with large-scale scenes via spatial-based partition strategy to reduce video memory and optimization time costs. In this work, we introduce a parallel Gaussian splatting method, termed PG-SAG, which fully exploits semantic cues for both partitioning and Gaussian kernel optimization, enabling fine-grained building surface reconstruction of large-scale urban areas without downsampling the original image resolution. First, the Cross-modal model - Language Segment Anything is leveraged to segment building masks. Then, the segmented building regions is grouped into sub-regions according to the visibility check across registered images. The Gaussian kernels for these sub-regions are optimized in parallel with masked pixels. In addition, the normal loss is re-formulated for the detected edges of masks to alleviate the ambiguities in normal vectors on edges. Finally, to improve the optimization of 3D Gaussians, we introduce a gradient-constrained balance-load loss that accounts for the complexity of the corresponding scenes, effectively minimizing the thread waiting time in the pixel-parallel rendering stage as well as the reconstruction lost. Extensive experiments are tested on various urban datasets, the results demonstrated the superior performance of our PG-SAG on building surface reconstruction, compared to several state-of-the-art 3DGS-based methods. Project Web:https://github.com/TFWang-9527/PG-SAG.
中文:PG-SAG是一种利用语义线索进行分区和优化的并行高斯溅射方法,能够高效精确地实现大规模城市场景的高分辨率重建。
English: PG-SAG is a parallel Gaussian splatting method that leverages semantic cues for partitioning and optimization, enabling high-resolution reconstruction of large-scale urban scenes with improved efficiency and accuracy.

Authors:Bohan Zhang, Xiaokang Zhang, Jing Zhang, Jifan Yu, Sijia Luo, Jie Tang
Title: CoT-based Synthesizer: Enhancing LLM Performance through Answer Synthesis
Abstract:
Current inference scaling methods, such as Self-consistency and Best-of-N, have proven effective in improving the accuracy of LLMs on complex reasoning tasks. However, these methods rely heavily on the quality of candidate responses and are unable to produce correct answers when all candidates are incorrect. In this paper, we propose a novel inference scaling strategy, CoT-based Synthesizer, which leverages CoT reasoning to synthesize superior answers by analyzing complementary information from multiple candidate responses, even when all candidate responses are flawed. To enable a lightweight and cost-effective implementation, we introduce an automated data generation pipeline that creates diverse training data. This allows smaller LLMs trained on this data to improve the inference accuracy of larger models, including API-based LLMs. Experimental results across four benchmark datasets with seven policy models demonstrate that our method significantly enhances performance, with gains of 11.8% for Llama3-8B and 10.3% for GPT-4o on the MATH dataset. The corresponding training data and code are publicly available on https://github.com/RUCKBReasoning/CoT-based-Synthesizer.
Chinese: 本文提出了一种基于思维链的合成器新策略,通过分析多个候选回答的互补信息来合成更优答案,即使在所有候选答案均有缺陷时也能提升大语言模型的推理准确率,实验证明该方法在多个基准数据集上显著提升了模型性能。
English: This paper introduces a CoT-based Synthesizer, a novel inference scaling strategy that enhances LLM accuracy by synthesizing superior answers from flawed candidate responses using complementary reasoning, with experiments showing significant performance gains on benchmark datasets.

Authors:Yin Cai, Zhouhong Gu, Zhaohan Du, Zheyu Ye, Shaosheng Cao, Yiqian Xu, Hongwei Feng, Ping Chen
Title: MIRAGE: Exploring How Large Language Models Perform in Complex Social Interactive Environments
Abstract:
Large Language Models (LLMs) have shown remarkable capabilities in environmental perception, reasoning-based decision-making, and simulating complex human behaviors, particularly in interactive role-playing contexts. This paper introduces the Multiverse Interactive Role-play Ability General Evaluation (MIRAGE), a comprehensive framework designed to assess LLMs' proficiency in portraying advanced human behaviors through murder mystery games. MIRAGE features eight intricately crafted scripts encompassing diverse themes and styles, providing a rich simulation. To evaluate LLMs' performance, MIRAGE employs four distinct methods: the Trust Inclination Index (TII) to measure dynamics of trust and suspicion, the Clue Investigation Capability (CIC) to measure LLMs' capability of conducting information, the Interactivity Capability Index (ICI) to assess role-playing capabilities and the Script Compliance Index (SCI) to assess LLMs' capability of understanding and following instructions. Our experiments indicate that even popular models like GPT-4 face significant challenges in navigating the complexities presented by the MIRAGE. The datasets and simulation codes are available in \href{https://github.com/lime728/MIRAGE}{github}.
中文摘要:本文提出MIRAGE框架,通过谋杀之谜游戏评估大语言模型模拟复杂人类行为的能力,发现即使是GPT-4等先进模型在复杂角色扮演场景中仍面临显著挑战。
English Summary: This paper introduces the MIRAGE framework to evaluate large language models' ability to simulate complex human behaviors through murder mystery games, finding that even advanced models like GPT-4 struggle with the sophisticated role-playing scenarios.

Authors:Kang Yi, Haoran Tang, Yumeng Li, Jing Xu, Jun Zhang
Title: Dual Mutual Learning Network with Global-local Awareness for RGB-D Salient Object Detection
Abstract:
RGB-D salient object detection (SOD), aiming to highlight prominent regions of a given scene by jointly modeling RGB and depth information, is one of the challenging pixel-level prediction tasks. Recently, the dual-attention mechanism has been devoted to this area due to its ability to strengthen the detection process. However, most existing methods directly fuse attentional cross-modality features under a manual-mandatory fusion paradigm without considering the inherent discrepancy between the RGB and depth, which may lead to a reduction in performance. Moreover, the long-range dependencies derived from global and local information make it difficult to leverage a unified efficient fusion strategy. Hence, in this paper, we propose the GL-DMNet, a novel dual mutual learning network with global-local awareness. Specifically, we present a position mutual fusion module and a channel mutual fusion module to exploit the interdependencies among different modalities in spatial and channel dimensions. Besides, we adopt an efficient decoder based on cascade transformer-infused reconstruction to integrate multi-level fusion features jointly. Extensive experiments on six benchmark datasets demonstrate that our proposed GL-DMNet performs better than 24 RGB-D SOD methods, achieving an average improvement of ~3% across four evaluation metrics compared to the second-best model (S3Net). Codes and results are available at https://github.com/kingkung2016/GL-DMNet.
中文: 本文提出GL-DMNet,一种具有全局-局部感知的双重互学习网络,通过位置和通道互融合模块解决RGB-D显著目标检测中的跨模态特征融合难题,在六个基准数据集上超越24种现有方法,四项指标平均提升约3%。
English: This paper introduces GL-DMNet, a novel dual mutual learning network with global-local awareness that addresses challenges in RGB-D salient object detection by employing position and channel mutual fusion modules to effectively integrate cross-modality features, outperforming 24 existing methods with an average 3% improvement across four metrics.

Authors:Tien Dang, Viet Thanh Duy Nguyen, Minh Tuan Le, Truong-Son Hy
Title: Multimodal Contrastive Representation Learning in Augmented Biomedical Knowledge Graphs
Abstract:
Biomedical Knowledge Graphs (BKGs) integrate diverse datasets to elucidate complex relationships within the biomedical field. Effective link prediction on these graphs can uncover valuable connections, such as potential novel drug-disease relations. We introduce a novel multimodal approach that unifies embeddings from specialized Language Models (LMs) with Graph Contrastive Learning (GCL) to enhance intra-entity relationships while employing a Knowledge Graph Embedding (KGE) model to capture inter-entity relationships for effective link prediction. To address limitations in existing BKGs, we present PrimeKG++, an enriched knowledge graph incorporating multimodal data, including biological sequences and textual descriptions for each entity type. By combining semantic and relational information in a unified representation, our approach demonstrates strong generalizability, enabling accurate link predictions even for unseen nodes. Experimental results on PrimeKG++ and the DrugBank drug-target interaction dataset demonstrate the effectiveness and robustness of our method across diverse biomedical datasets. Our source code, pre-trained models, and data are publicly available at https://github.com/HySonLab/BioMedKG
我们提出的多模态方法将语言模型嵌入与图对比学习和知识图谱嵌入相结合,显著提升了生物医学链接预测效果,在PrimeKG++和DrugBank等丰富知识图谱数据集上表现出卓越性能。
Our novel multimodal approach integrates language model embeddings with graph contrastive learning and knowledge graph embeddings to enhance biomedical link prediction, demonstrating strong performance on enriched knowledge graphs like PrimeKG++ and DrugBank datasets.

Authors:Zihao Wang, Yuxiang Wei, Fan Li, Renjing Pei, Hang Xu, Wangmeng Zuo
Title: ACE: Anti-Editing Concept Erasure in Text-to-Image Models
Abstract:
Recent advance in text-to-image diffusion models have significantly facilitated the generation of high-quality images, but also raising concerns about the illegal creation of harmful content, such as copyrighted images. Existing concept erasure methods achieve superior results in preventing the production of erased concept from prompts, but typically perform poorly in preventing undesired editing. To address this issue, we propose an Anti-Editing Concept Erasure (ACE) method, which not only erases the target concept during generation but also filters out it during editing. Specifically, we propose to inject the erasure guidance into both conditional and the unconditional noise prediction, enabling the model to effectively prevent the creation of erasure concepts during both editing and generation. Furthermore, a stochastic correction guidance is introduced during training to address the erosion of unrelated concepts. We conducted erasure editing experiments with representative editing methods (i.e., LEDITS++ and MasaCtrl) to erase IP characters, and the results indicate that our ACE effectively filters out target concepts in both types of edits. Additional experiments on erasing explicit concepts and artistic styles further demonstrate that our ACE performs favorably against state-of-the-art methods. Our code will be publicly available at https://github.com/120L020904/ACE.
中文: 提出的抗编辑概念擦除(ACE)方法通过在条件和非条件噪声预测中融入擦除指导,有效防止有害或受版权保护内容的生成与编辑,在多种场景下优于现有技术。
English: The proposed Anti-Editing Concept Erasure (ACE) method effectively prevents the generation and editing of harmful or copyrighted content by integrating erasure guidance into both conditional and unconditional noise predictions, outperforming existing techniques in various scenarios.

Authors:Yun Zhu, Dong Zhang, Yi Lin, Yifei Feng, Jinhui Tang
Title: Merging Context Clustering with Visual State Space Models for Medical Image Segmentation
Abstract:
Medical image segmentation demands the aggregation of global and local feature representations, posing a challenge for current methodologies in handling both long-range and short-range feature interactions. Recently, vision mamba (ViM) models have emerged as promising solutions for addressing model complexities by excelling in long-range feature iterations with linear complexity. However, existing ViM approaches overlook the importance of preserving short-range local dependencies by directly flattening spatial tokens and are constrained by fixed scanning patterns that limit the capture of dynamic spatial context information. To address these challenges, we introduce a simple yet effective method named context clustering ViM (CCViM), which incorporates a context clustering module within the existing ViM models to segment image tokens into distinct windows for adaptable local clustering. Our method effectively combines long-range and short-range feature interactions, thereby enhancing spatial contextual representations for medical image segmentation tasks. Extensive experimental evaluations on diverse public datasets, i.e., Kumar, CPM17, ISIC17, ISIC18, and Synapse demonstrate the superior performance of our method compared to current state-of-the-art methods. Our code can be found at https://github.com/zymissy/CCViM.
中文: 提出的CCViM方法通过在视觉Mamba模型中引入上下文聚类模块,有效结合长程与短程特征交互,在多个数据集上显著提升了医学图像分割的空间上下文表征能力。
English: The proposed CCViM method enhances medical image segmentation by integrating a context clustering module into vision mamba models, effectively combining long-range and short-range feature interactions to improve spatial contextual representations across multiple datasets.

Authors:Yao Ding, Weijie Kang, Aitao Yang, Zhili Zhang, Junyang Zhao, Jie Feng, Danfeng Hong, Qinhe Zheng
Title: Adaptive Homophily Clustering: Structure Homophily Graph Learning with Adaptive Filter for Hyperspectral Image
Abstract:
Hyperspectral image (HSI) clustering has been a fundamental but challenging task with zero training labels. Currently, some deep graph clustering methods have been successfully explored for HSI due to their outstanding performance in effective spatial structural information encoding. Nevertheless, insufficient structural information utilization, poor feature presentation ability, and weak graph update capability limit their performance. Thus, in this paper, a homophily structure graph learning with an adaptive filter clustering method (AHSGC) for HSI is proposed. Specifically, homogeneous region generation is first developed for HSI processing and constructing the original graph. Afterward, an adaptive filter graph encoder is designed to adaptively capture the high and low frequency features on the graph for subsequence processing. Then, a graph embedding clustering self-training decoder is developed with KL Divergence, with which the pseudo-label is generated for network training. Meanwhile, homophily-enhanced structure learning is introduced to update the graph according to the clustering task, in which the orient correlation estimation is adopted to estimate the node connection, and graph edge sparsification is designed to adjust the edges in the graph dynamically. Finally, a joint network optimization is introduced to achieve network self-training and update the graph. The K-means is adopted to express the latent features. Extensive experiments and repeated comparative analysis have verified that our AHSGC contains high clustering accuracy, low computational complexity, and strong robustness. The code source will be available at https://github.com/DY-HYX.
中文: 本文提出了一种用于高光谱图像聚类的同质性结构图学习方法AHSGC,通过自适应图滤波和结构优化,显著提升了聚类精度和鲁棒性,同时降低了计算复杂度。
English: This paper introduces AHSGC, a novel homophily structure graph learning method with adaptive filtering for hyperspectral image clustering, which enhances feature representation and graph updating to achieve high accuracy, low complexity, and strong robustness.

Authors:Juliette Fenogli, Laurence Grimaud, Rodolphe Vuilleumier
Title: Constructing and explaining machine learning models for chemistry: example of the exploration and design of boron-based Lewis acids
Abstract:
The integration of machine learning (ML) into chemistry offers transformative potential in the design of molecules with targeted properties. However, the focus has often been on creating highly efficient predictive models, sometimes at the expense of interpretability. In this study, we leverage explainable AI techniques to explore the rational design of boron-based Lewis acids, which play a pivotal role in organic reactions due to their electron-ccepting properties. Using Fluoride Ion Affinity as a proxy for Lewis acidity, we developed interpretable ML models based on chemically meaningful descriptors, including ab initio computed features and substituent-based parameters derived from the Hammett linear free-energy relationship. By constraining the chemical space to well-defined molecular scaffolds, we achieved highly accurate predictions (mean absolute error < 6 kJ/mol), surpassing conventional black-box deep learning models in low-data regimes. Interpretability analyses of the models shed light on the origin of Lewis acidity in these compounds and identified actionable levers to modulate it through the nature and positioning of substituents on the molecular scaffold. This work bridges ML and chemist's way of thinking, demonstrating how explainable models can inspire molecular design and enhance scientific understanding of chemical reactivity.
Chinese: 本研究应用可解释人工智能技术开发了可解释的机器学习模型,用于设计硼基路易斯酸,不仅实现了高精度预测,还揭示了可操作的化学调控机制,成功搭建了计算方法与化学思维之间的桥梁。
English: This study applies explainable AI to develop interpretable machine learning models for designing boron-based Lewis acids, achieving high predictive accuracy and revealing actionable chemical insights that bridge computational methods with chemical reasoning.

Authors:Lihao Wang
Title: Click-Calib: A Robust Extrinsic Calibration Method for Surround-View Systems
Abstract:
Surround-View System (SVS) is an essential component in Advanced Driver Assistance System (ADAS) and requires precise calibrations. However, conventional offline extrinsic calibration methods are cumbersome and time-consuming as they rely heavily on physical patterns. Additionally, these methods primarily focus on short-range areas surrounding the vehicle, resulting in lower calibration quality in more distant zones. To address these limitations, we propose Click-Calib, a pattern-free approach for offline SVS extrinsic calibration. Without requiring any special setup, the user only needs to click a few keypoints on the ground in natural scenes. Unlike other offline calibration approaches, Click-Calib optimizes camera poses over a wide range by minimizing reprojection distance errors of keypoints, thereby achieving accurate calibrations at both short and long distances. Furthermore, Click-Calib supports both single-frame and multiple-frame modes, with the latter offering even better results. Evaluations on our in-house dataset and the public WoodScape dataset demonstrate its superior accuracy and robustness compared to baseline methods. Code is available at https://github.com/lwangvaleo/click_calib.
Chinese: Click-Calib是一种无需标定板的环视系统离线外参标定方法,用户只需在自然场景地面点击少量关键点,即可实现短距和长距的精确标定,在精度和鲁棒性上均优于基线方法。
English: Click-Calib is a pattern-free, offline extrinsic calibration method for Surround-View Systems that requires only a few user-clicked keypoints on the ground to achieve accurate calibration across both short and long distances, outperforming baseline methods in accuracy and robustness.

Authors:Ved G. Shah, Alex Gagliano, Konstantin Malanchev, Gautham Narayan, The LSST Dark Energy Science Collaboration
Title: ORACLE: A Real-Time, Hierarchical, Deep-Learning Photometric Classifier for the LSST
Abstract:
We present ORACLE, the first hierarchical deep-learning model for real-time, context-aware classification of transient and variable astrophysical phenomena. ORACLE is a recurrent neural network with Gated Recurrent Units (GRUs), and has been trained using a custom hierarchical cross-entropy loss function to provide high-confidence classifications along an observationally-driven taxonomy with as little as a single photometric observation. Contextual information for each object, including host galaxy photometric redshift, offset, ellipticity and brightness, is concatenated to the light curve embedding and used to make a final prediction. Training on $\sim$0.5M events from the Extended LSST Astronomical Time-Series Classification Challenge, we achieve a top-level (Transient vs Variable) macro-averaged precision of 0.96 using only 1 day of photometric observations after the first detection in addition to contextual information, for each event; this increases to $>$0.99 once 64 days of the light curve has been obtained, and 0.83 at 1024 days after first detection for 19-way classification (including supernova sub-types, active galactic nuclei, variable stars, microlensing events, and kilonovae). We also compare ORACLE with other state-of-the-art classifiers and report comparable performance for the 19-way classification task, in addition to delivering accurate top-level classifications much earlier. The code and model weights used in this work are publicly available at our associated GitHub repository (https://github.com/uiucsn/ELAsTiCC-Classification).
ORACLE是首个利用门控循环单元和上下文数据的分层深度学习模型,能够实时分类天体物理现象,仅需少量观测即可实现高精度识别,并在早期检测中优于其他分类器。
ORACLE is the first hierarchical deep-learning model using GRUs and contextual data to classify astrophysical phenomena in real time, achieving high precision with minimal observations and outperforming other classifiers in early-stage detection.

Authors:George Yuanji Wang, Srisharan Murugesan, Aditya Prince Rohatgi
Title: GAN-TAT: A Novel Framework Using Protein Interaction Networks in Druggable Gene Identification
Abstract:
Identifying druggable genes is essential for developing effective pharmaceuticals. With the availability of extensive, high-quality data, computational methods have become a significant asset. Protein Interaction Network (PIN) is valuable but challenging to implement due to its high dimensionality and sparsity. Previous methods relied on indirect integration, leading to resolution loss. This study proposes GAN-TAT, a framework utilizing an advanced graph embedding technology, ImGAGN, to directly integrate PIN for druggable gene inference work. Tested on three Pharos datasets, GAN-TAT achieved the highest AUC-ROC score of 0.951 on Tclin. Further evaluation shows that GAN-TAT's predictions are supported by clinical evidence, highlighting its potential practical applications in pharmacogenomics. This research represents a methodological attempt with the direct utilization of PIN, expanding potential new solutions for developing drug targets. The source code of GAN-TAT is available at (https://github.com/george-yuanji-wang/GAN-TAT).
中文: 本研究提出GAN-TAT框架,通过先进图嵌入技术直接整合蛋白质相互作用网络来精准识别可成药基因,在测试中表现优异,并展现出在药物基因组学中的实际应用潜力。
English: This study introduces GAN-TAT, a framework that directly integrates Protein Interaction Networks using advanced graph embedding to accurately identify druggable genes, achieving top performance in tests and demonstrating practical potential in pharmacogenomics.

Authors:Jingfeng Yao, Bin Yang, Xinggang Wang
Title: Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models
Abstract:
Latent diffusion models with Transformer architectures excel at generating high-fidelity images. However, recent studies reveal an optimization dilemma in this two-stage design: while increasing the per-token feature dimension in visual tokenizers improves reconstruction quality, it requires substantially larger diffusion models and more training iterations to achieve comparable generation performance. Consequently, existing systems often settle for sub-optimal solutions, either producing visual artifacts due to information loss within tokenizers or failing to converge fully due to expensive computation costs. We argue that this dilemma stems from the inherent difficulty in learning unconstrained high-dimensional latent spaces. To address this, we propose aligning the latent space with pre-trained vision foundation models when training the visual tokenizers. Our proposed VA-VAE (Vision foundation model Aligned Variational AutoEncoder) significantly expands the reconstruction-generation frontier of latent diffusion models, enabling faster convergence of Diffusion Transformers (DiT) in high-dimensional latent spaces. To exploit the full potential of VA-VAE, we build an enhanced DiT baseline with improved training strategies and architecture designs, termed LightningDiT. The integrated system achieves state-of-the-art (SOTA) performance on ImageNet 256x256 generation with an FID score of 1.35 while demonstrating remarkable training efficiency by reaching an FID score of 2.11 in just 64 epochs--representing an over 21 times convergence speedup compared to the original DiT. Models and codes are available at: https://github.com/hustvl/LightningDiT.
中文: 该研究揭示了潜在扩散模型中视觉分词器质量与计算效率之间的优化困境,提出VA-VAE方法将潜在空间与预训练视觉模型对齐,通过集成LightningDiT系统实现了最先进的图像生成性能,并显著加快了收敛速度。
English: The study identifies an optimization dilemma in latent diffusion models where improving visual tokenizer quality conflicts with computational efficiency, and proposes VA-VAE to align latent spaces with pre-trained vision models, achieving state-of-the-art image generation with significantly faster convergence through the integrated LightningDiT system.

Authors:Xudong Jiang, Fangjinhua Wang, Silvano Galliani, Christoph Vogel, Marc Pollefeys
Title: R-SCoRe: Revisiting Scene Coordinate Regression for Robust Large-Scale Visual Localization
Abstract:
Learning-based visual localization methods that use scene coordinate regression (SCR) offer the advantage of smaller map sizes. However, on datasets with complex illumination changes or image-level ambiguities, it remains a less robust alternative to feature matching methods. This work aims to close the gap. We introduce a covisibility graph-based global encoding learning and data augmentation strategy, along with a depth-adjusted reprojection loss to facilitate implicit triangulation. Additionally, we revisit the network architecture and local feature extraction module. Our method achieves state-of-the-art on challenging large-scale datasets without relying on network ensembles or 3D supervision. On Aachen Day-Night, we are 10$\times$ more accurate than previous SCR methods with similar map sizes and require at least 5$\times$ smaller map sizes than any other SCR method while still delivering superior accuracy. Code is available at: https://github.com/cvg/scrstudio .
中文: 本研究通过引入基于共视图的全局编码学习、数据增强策略及深度调整重投影损失,提升了场景坐标回归在视觉定位中的性能,在复杂数据集上以更小的地图尺寸实现了顶尖的精度。
English: This study enhances scene coordinate regression for visual localization by introducing a covisibility graph-based global encoding, data augmentation, and a depth-adjusted reprojection loss, achieving state-of-the-art accuracy with significantly reduced map sizes on challenging datasets.

Authors:Yoshitomo Matsubara, Matteo Mendula, Marco Levorato
Title: A Multi-task Supervised Compression Model for Split Computing
Abstract:
Split computing ($\neq$ split learning) is a promising approach to deep learning models for resource-constrained edge computing systems, where weak sensor (mobile) devices are wirelessly connected to stronger edge servers through channels with limited communication capacity. State-of-theart work on split computing presents methods for single tasks such as image classification, object detection, or semantic segmentation. The application of existing methods to multitask problems degrades model accuracy and/or significantly increase runtime latency. In this study, we propose Ladon, the first multi-task-head supervised compression model for multi-task split computing. Experimental results show that the multi-task supervised compression model either outperformed or rivaled strong lightweight baseline models in terms of predictive performance for ILSVRC 2012, COCO 2017, and PASCAL VOC 2012 datasets while learning compressed representations at its early layers. Furthermore, our models reduced end-to-end latency (by up to 95.4%) and energy consumption of mobile devices (by up to 88.2%) in multi-task split computing scenarios.
中文: Ladon是首个用于分割计算的多任务监督压缩模型,它在多个数据集上提升预测性能的同时,大幅降低了移动设备的延迟和能耗。
English: Ladon is the first multi-task supervised compression model for split computing that enhances predictive performance on multiple datasets while significantly reducing latency and energy consumption on mobile devices.

Authors:Yidi Shao, Chen Change Loy, Bo Dai
Title: Learning 3D Garment Animation from Trajectories of A Piece of Cloth
Abstract:
Garment animation is ubiquitous in various applications, such as virtual reality, gaming, and film producing. Recently, learning-based approaches obtain compelling performance in animating diverse garments under versatile scenarios. Nevertheless, to mimic the deformations of the observed garments, data-driven methods require large scale of garment data, which are both resource-wise expensive and time-consuming. In addition, forcing models to match the dynamics of observed garment animation may hinder the potentials to generalize to unseen cases. In this paper, instead of using garment-wise supervised-learning we adopt a disentangled scheme to learn how to animate observed garments: 1). learning constitutive behaviors from the observed cloth; 2). dynamically animate various garments constrained by the learned constitutive laws. Specifically, we propose Energy Unit network (EUNet) to model the constitutive relations in the format of energy. Without the priors from analytical physics models and differentiable simulation engines, EUNet is able to directly capture the constitutive behaviors from the observed piece of cloth and uniformly describes the change of energy caused by deformations, such as stretching and bending. We further apply the pre-trained EUNet to animate various garments based on energy optimizations. The disentangled scheme alleviates the need of garment data and enables us to utilize the dynamics of a piece of cloth for animating garments. Experiments show that while EUNet effectively delivers the energy gradients due to the deformations, models constrained by EUNet achieve more stable and physically plausible performance comparing with those trained in garment-wise supervised manner. Code is available at https://github.com/ftbabi/EUNet_NeurIPS2024.git .
Chinese: 本文提出EUNet这一解耦学习方案,通过能量动态模拟服装本构行为,无需大量服装专用数据即可实现稳定且符合物理规律的动画效果。
English: This paper introduces EUNet, a disentangled learning approach that models garment constitutive behaviors from energy dynamics, enabling stable and physically plausible animation without requiring extensive garment-specific data.

Authors:Xuyin Qi, Zeyu Zhang, Aaron Berliano Handoko, Huazhan Zheng, Mingxi Chen, Ta Duc Huy, Vu Minh Hieu Phan, Lei Zhang, Linqi Cheng, Shiyu Jiang, Zhiwei Zhang, Zhibin Liao, Yang Zhao, Minh-Son To
Title: ProjectedEx: Enhancing Generation in Explainable AI for Prostate Cancer
Abstract:
Prostate cancer, a growing global health concern, necessitates precise diagnostic tools, with Magnetic Resonance Imaging (MRI) offering high-resolution soft tissue imaging that significantly enhances diagnostic accuracy. Recent advancements in explainable AI and representation learning have significantly improved prostate cancer diagnosis by enabling automated and precise lesion classification. However, existing explainable AI methods, particularly those based on frameworks like generative adversarial networks (GANs), are predominantly developed for natural image generation, and their application to medical imaging often leads to suboptimal performance due to the unique characteristics and complexity of medical image. To address these challenges, our paper introduces three key contributions. First, we propose ProjectedEx, a generative framework that provides interpretable, multi-attribute explanations, effectively linking medical image features to classifier decisions. Second, we enhance the encoder module by incorporating feature pyramids, which enables multiscale feedback to refine the latent space and improves the quality of generated explanations. Additionally, we conduct comprehensive experiments on both the generator and classifier, demonstrating the clinical relevance and effectiveness of ProjectedEx in enhancing interpretability and supporting the adoption of AI in medical settings. Code will be released at https://github.com/Richardqiyi/ProjectedEx
中文: 本文提出了ProjectedEx这一可解释AI框架,通过生成多属性医学图像解释并整合特征金字塔技术,显著提升了前列腺癌诊断的临床可解释性和实用性。
English: This paper introduces ProjectedEx, an explainable AI framework that enhances prostate cancer diagnosis by generating multi-attribute medical image explanations and incorporating feature pyramids for improved interpretability in clinical settings.

Authors:Yong Zhao, Yang Deng, See-Kiong Ng, Tat-Seng Chua
Title: Aligning Large Language Models for Faithful Integrity Against Opposing Argument
Abstract:
Large Language Models (LLMs) have demonstrated impressive capabilities in complex reasoning tasks. However, they can be easily misled by unfaithful arguments during conversations, even when their original statements are correct. To this end, we investigate the problem of maintaining faithful integrity in LLMs. This involves ensuring that LLMs adhere to their faithful statements in the face of opposing arguments and are able to correct their incorrect statements when presented with faithful arguments. In this work, we propose a novel framework, named Alignment for Faithful Integrity with Confidence Estimation (AFICE), which aims to align the LLM responses with faithful integrity. Specifically, AFICE first designs a Bilateral Confidence Estimation (BCE) approach for estimating the uncertainty of each response generated by the LLM given a specific context, which simultaneously estimate the model's confidence to the question based on the internal states during decoding as well as to the answer based on cumulative probability ratios. With the BCE, we construct a conversational preference dataset composed of context, original statement, and argument, which is adopted for aligning the LLM for faithful integrity using Direct Preference Optimization (DPO). Extensive experimental results on a wide range of benchmarks demonstrate significant improvements in the LLM's ability to maintain faithful responses when encountering opposing arguments, ensuring both the practical utility and trustworthiness of LLMs in complex interactive settings. Code and data will be released via https://github.com/zhaoy777/AFICE.git
中文摘要:AFICE框架通过双边置信度估计和直接偏好优化,增强大语言模型在遇到对立论点时保持忠实完整性的能力,确保其回应的可靠性和一致性。
English Summary: The AFICE framework enhances large language models' ability to maintain faithful integrity by using bilateral confidence estimation and direct preference optimization to ensure consistent responses despite opposing arguments.

Authors:Leandro Di Bella, Yangxintong Lyu, Bruno Cornelis, Adrian Munteanu
Title: HybridTrack: A Hybrid Approach for Robust Multi-Object Tracking
Abstract:
The evolution of Advanced Driver Assistance Systems (ADAS) has increased the need for robust and generalizable algorithms for multi-object tracking. Traditional statistical model-based tracking methods rely on predefined motion models and assumptions about system noise distributions. Although computationally efficient, they often lack adaptability to varying traffic scenarios and require extensive manual design and parameter tuning. To address these issues, we propose a novel 3D multi-object tracking approach for vehicles, HybridTrack, which integrates a data-driven Kalman Filter (KF) within a tracking-by-detection paradigm. In particular, it learns the transition residual and Kalman gain directly from data, which eliminates the need for manual motion and stochastic parameter modeling. Validated on the real-world KITTI dataset, HybridTrack achieves 82.72% HOTA accuracy, significantly outperforming state-of-the-art methods. We also evaluate our method under different configurations, achieving the fastest processing speed of 112 FPS. Consequently, HybridTrack eliminates the dependency on scene-specific designs while improving performance and maintaining real-time efficiency. The code is publicly available at: https://github.com/leandro-svg/HybridTrack.
Chinese: HybridTrack提出了一种创新的3D多目标跟踪方法,通过集成数据驱动的卡尔曼滤波器消除了手动运动建模需求,在KITTI数据集上实现了卓越的准确性和实时处理能力。
English: HybridTrack introduces a novel 3D multi-object tracking method that integrates a data-driven Kalman Filter to eliminate manual motion modeling, achieving superior accuracy and real-time performance on the KITTI dataset.

Authors:Xiaoshuai Song, Yanan Wu, Weixun Wang, Jiaheng Liu, Wenbo Su, Bo Zheng
Title: ProgCo: Program Helps Self-Correction of Large Language Models
Abstract:
Self-Correction aims to enable large language models (LLMs) to self-verify and self-refine their initial responses without external feedback. However, LLMs often fail to effectively self-verify and generate correct feedback, further misleading refinement and leading to the failure of self-correction, especially in complex reasoning tasks. In this paper, we propose Program-driven Self-Correction (ProgCo). First, program-driven verification (ProgVe) achieves complex verification logic and extensive validation through self-generated, self-executing verification pseudo-programs. Then, program-driven refinement (ProgRe) receives feedback from ProgVe, conducts dual reflection and refinement on both responses and verification programs to mitigate misleading of incorrect feedback in complex reasoning tasks. Experiments on three instruction-following and mathematical benchmarks indicate that ProgCo achieves effective self-correction, and can be further enhance performance when combined with real program tools. We release our code at https://github.com/songxiaoshuai/progco.
大语言模型的自校正通过ProgCo得以增强,它利用自生成的验证伪程序来验证和优化响应,从而在复杂推理任务中提升性能。
Self-correction in large language models is enhanced by ProgCo, which uses self-generated verification pseudo-programs to verify and refine responses, improving performance in complex reasoning tasks.

Authors:Yongle Huang, Haodong Chen, Zhenbang Xu, Zihan Jia, Haozhou Sun, Dian Shao
Title: SeFAR: Semi-supervised Fine-grained Action Recognition with Temporal Perturbation and Learning Stabilization
Abstract:
Human action understanding is crucial for the advancement of multimodal systems. While recent developments, driven by powerful large language models (LLMs), aim to be general enough to cover a wide range of categories, they often overlook the need for more specific capabilities. In this work, we address the more challenging task of Fine-grained Action Recognition (FAR), which focuses on detailed semantic labels within shorter temporal duration (e.g., "salto backward tucked with 1 turn"). Given the high costs of annotating fine-grained labels and the substantial data needed for fine-tuning LLMs, we propose to adopt semi-supervised learning (SSL). Our framework, SeFAR, incorporates several innovative designs to tackle these challenges. Specifically, to capture sufficient visual details, we construct Dual-level temporal elements as more effective representations, based on which we design a new strong augmentation strategy for the Teacher-Student learning paradigm through involving moderate temporal perturbation. Furthermore, to handle the high uncertainty within the teacher model's predictions for FAR, we propose the Adaptive Regulation to stabilize the learning process. Experiments show that SeFAR achieves state-of-the-art performance on two FAR datasets, FineGym and FineDiving, across various data scopes. It also outperforms other semi-supervised methods on two classical coarse-grained datasets, UCF101 and HMDB51. Further analysis and ablation studies validate the effectiveness of our designs. Additionally, we show that the features extracted by our SeFAR could largely promote the ability of multimodal foundation models to understand fine-grained and domain-specific semantics.
Chinese: 本文提出SeFAR框架,通过双级时间元素和自适应调节的半监督学习方法,在细粒度动作识别任务中实现了最优性能,并有效提升了多模态基础模型的领域语义理解能力。
English: This paper introduces SeFAR, a semi-supervised learning framework designed for Fine-grained Action Recognition that incorporates dual-level temporal elements and adaptive regulation to achieve state-of-the-art performance on specialized datasets.

Authors:Zhiyao Wang, Xu Chen, Chengming Xu, Junwei Zhu, Xiaobin Hu, Jiangning Zhang, Chengjie Wang, Yuqi Liu, Yiyi Zhou, Rongrong Ji
Title: SVFR: A Unified Framework for Generalized Video Face Restoration
Abstract:
Face Restoration (FR) is a crucial area within image and video processing, focusing on reconstructing high-quality portraits from degraded inputs. Despite advancements in image FR, video FR remains relatively under-explored, primarily due to challenges related to temporal consistency, motion artifacts, and the limited availability of high-quality video data. Moreover, traditional face restoration typically prioritizes enhancing resolution and may not give as much consideration to related tasks such as facial colorization and inpainting. In this paper, we propose a novel approach for the Generalized Video Face Restoration (GVFR) task, which integrates video BFR, inpainting, and colorization tasks that we empirically show to benefit each other. We present a unified framework, termed as stable video face restoration (SVFR), which leverages the generative and motion priors of Stable Video Diffusion (SVD) and incorporates task-specific information through a unified face restoration framework. A learnable task embedding is introduced to enhance task identification. Meanwhile, a novel Unified Latent Regularization (ULR) is employed to encourage the shared feature representation learning among different subtasks. To further enhance the restoration quality and temporal stability, we introduce the facial prior learning and the self-referred refinement as auxiliary strategies used for both training and inference. The proposed framework effectively combines the complementary strengths of these tasks, enhancing temporal coherence and achieving superior restoration quality. This work advances the state-of-the-art in video FR and establishes a new paradigm for generalized video face restoration. Code and video demo are available at https://github.com/wangzhiyaoo/SVFR.git.
中文: 本文提出了一种新颖的通用视频人脸修复框架,通过整合修复与着色等多任务,利用稳定视频扩散先验和统一潜在正则化方法,显著提升了视频修复的时间一致性与质量。
English: This paper introduces a novel generalized video face restoration framework that integrates multiple tasks like inpainting and colorization using Stable Video Diffusion priors, enhancing temporal consistency and restoration quality through unified latent regularization and auxiliary strategies.

Authors:Amil Bhagat, Milind Jain, A. V. Subramanyam
Title: Conditional Consistency Guided Image Translation and Enhancement
Abstract:
Consistency models have emerged as a promising alternative to diffusion models, offering high-quality generative capabilities through single-step sample generation. However, their application to multi-domain image translation tasks, such as cross-modal translation and low-light image enhancement remains largely unexplored. In this paper, we introduce Conditional Consistency Models (CCMs) for multi-domain image translation by incorporating additional conditional inputs. We implement these modifications by introducing task-specific conditional inputs that guide the denoising process, ensuring that the generated outputs retain structural and contextual information from the corresponding input domain. We evaluate CCMs on 10 different datasets demonstrating their effectiveness in producing high-quality translated images across multiple domains. Code is available at https://github.com/amilbhagat/Conditional-Consistency-Models.
中文: 条件一致性模型通过引入任务特定的条件输入,实现了跨多个领域的高质量单步图像翻译,并在10个不同数据集上验证了其有效性。
English: Conditional Consistency Models (CCMs) introduce task-specific conditional inputs to enable high-quality, single-step image translation across multiple domains, demonstrating effectiveness on 10 diverse datasets.

Authors:Yitong Zhu, Zhuowen Liang, Yiming Wu, Tangyao Li, Yuyang Wang
Title: Towards Consumer-Grade Cybersickness Prediction: Multi-Model Alignment for Real-Time Vision-Only Inference
Abstract:
Cybersickness remains a major obstacle to the widespread adoption of immersive virtual reality (VR), particularly in consumer-grade environments. While prior methods rely on invasive signals such as electroencephalography (EEG) for high predictive accuracy, these approaches require specialized hardware and are impractical for real-world applications. In this work, we propose a scalable, deployable framework for personalized cybersickness prediction leveraging only non-invasive signals readily available from commercial VR headsets, including head motion, eye tracking, and physiological responses. Our model employs a modality-specific graph neural network enhanced with a Difference Attention Module to extract temporal-spatial embeddings capturing dynamic changes across modalities. A cross-modal alignment module jointly trains the video encoder to learn personalized traits by aligning video features with sensor-derived representations. Consequently, the model accurately predicts individual cybersickness using only video input during inference. Experimental results show our model achieves 88.4\% accuracy, closely matching EEG-based approaches (89.16\%), while reducing deployment complexity. With an average inference latency of 90ms, our framework supports real-time applications, ideal for integration into consumer-grade VR platforms without compromising personalization or performance. The code will be relesed at https://github.com/U235-Aurora/PTGNN.
中文摘要:本研究提出一种利用商用VR头显非侵入式信号的个性化晕动症预测框架,通过新型图神经网络设计实现接近脑电图方法的准确率,并具备实时应用能力。
English Summary: This study introduces a scalable framework for predicting cybersickness in VR using non-invasive signals from commercial headsets, achieving near-EEG accuracy with real-time performance through a novel graph neural network design.

Authors:Dat Nguyen, Marcella Astrid, Anis Kacem, Enjie Ghorbel, Djamila Aouada
Title: Vulnerability-Aware Spatio-Temporal Learning for Generalizable Deepfake Video Detection
Abstract:
Detecting deepfake videos is highly challenging given the complexity of characterizing spatio-temporal artifacts. Most existing methods rely on binary classifiers trained using real and fake image sequences, therefore hindering their generalization capabilities to unseen generation methods. Moreover, with the constant progress in generative Artificial Intelligence (AI), deepfake artifacts are becoming imperceptible at both the spatial and the temporal levels, making them extremely difficult to capture. To address these issues, we propose a fine-grained deepfake video detection approach called FakeSTormer that enforces the modeling of subtle spatio-temporal inconsistencies while avoiding overfitting. Specifically, we introduce a multi-task learning framework that incorporates two auxiliary branches for explicitly attending artifact-prone spatial and temporal regions. Additionally, we propose a video-level data synthesis strategy that generates pseudo-fake videos with subtle spatio-temporal artifacts, providing high-quality samples and hand-free annotations for our additional branches. Extensive experiments on several challenging benchmarks demonstrate the superiority of our approach compared to recent state-of-the-art methods. The code is available at https://github.com/10Ring/FakeSTormer.
Chinese: FakeSTormer方法通过多任务学习框架和视频级数据合成技术,有效捕捉深度伪造视频中的细微时空不一致性,在多个基准测试中优于现有方法。
English: The proposed FakeSTormer method enhances deepfake video detection by employing a multi-task learning framework and video-level data synthesis to capture subtle spatio-temporal inconsistencies, outperforming existing approaches on multiple benchmarks.

Authors:Lixiong Qin, Ning Jiang, Yang Zhang, Yuhan Qiu, Dingheng Zeng, Jiani Hu, Weihong Deng
Title: Towards Interactive Deepfake Analysis
Abstract:
Existing deepfake analysis methods are primarily based on discriminative models, which significantly limit their application scenarios. This paper aims to explore interactive deepfake analysis by performing instruction tuning on multi-modal large language models (MLLMs). This will face challenges such as the lack of datasets and benchmarks, and low training efficiency. To address these issues, we introduce (1) a GPT-assisted data construction process resulting in an instruction-following dataset called DFA-Instruct, (2) a benchmark named DFA-Bench, designed to comprehensively evaluate the capabilities of MLLMs in deepfake detection, deepfake classification, and artifact description, and (3) construct an interactive deepfake analysis system called DFA-GPT, as a strong baseline for the community, with the Low-Rank Adaptation (LoRA) module. The dataset and code will be made available at https://github.com/lxq1000/DFA-Instruct to facilitate further research.
中文摘要:本文通过指令调优多模态大语言模型探索交互式深度伪造分析,构建了DFA-Instruct数据集和DFA-Bench基准,并提出基于LoRA的DFA-GPT系统作为解决方案。
English Summary: This paper introduces an interactive deepfake analysis system using instruction-tuned multimodal large language models, addressing dataset gaps with DFA-Instruct and DFA-Bench while proposing DFA-GPT as a baseline with LoRA adaptation.

Authors:Anugunj Naman, Aaron Ault, Yaguang Zhang, James Krogmeier
Title: Automating Work Orders and Tracking Winter Snow Plows and Patrol Vehicles with Telematics Data
Abstract:
Winter road maintenance is a critical priority for the Indiana Department of Transportation, which manages an extensive fleet across thousands of lane miles. The current manual tracking of snowplow workloads is inefficient and prone to errors. To address these challenges, we developed an in-browser web application that automates the creation and verification of work orders using a large-scale GPS dataset from telematics systems. The application processes millions of GPS data points from hundreds of vehicles over winter, significantly reducing manual labor and minimizing errors. Key features include geohashing for efficient road segment identification, detailed segment-level work records, and robust visualization of vehicle movements, even on repeated routes. Our proposed solution has the potential to enhance the accuracy and granularity of work records, support more effective resource allocation, ensure timely compensation for drivers, alleviate administrative burdens, and allow managers to focus on strategic planning and real-time challenges. The web application can be accessed at https://github.com/oats-center/arrtrack/
中文: 该网络应用利用GPS数据自动生成和验证扫雪车工作单,有效提升印第安纳州交通部门冬季道路维护效率并减少人工错误。
English: This web application automates snowplow work order creation and verification using GPS data to improve efficiency and reduce errors in winter road maintenance for Indiana's transportation department.

Authors:Feng Han, Kai Chen, Chao Gong, Zhipeng Wei, Jingjing Chen, Yu-Gang Jiang
Title: DuMo: Dual Encoder Modulation Network for Precise Concept Erasure
Abstract:
The exceptional generative capability of text-to-image models has raised substantial safety concerns regarding the generation of Not-Safe-For-Work (NSFW) content and potential copyright infringement. To address these concerns, previous methods safeguard the models by eliminating inappropriate concepts. Nonetheless, these models alter the parameters of the backbone network and exert considerable influences on the structural (low-frequency) components of the image, which undermines the model's ability to retain non-target concepts. In this work, we propose our Dual encoder Modulation network (DuMo), which achieves precise erasure of inappropriate target concepts with minimum impairment to non-target concepts. In contrast to previous methods, DuMo employs the Eraser with PRior Knowledge (EPR) module which modifies the skip connection features of the U-NET and primarily achieves concept erasure on details (high-frequency) components of the image. To minimize the damage to non-target concepts during erasure, the parameters of the backbone U-NET are frozen and the prior knowledge from the original skip connection features is introduced to the erasure process. Meanwhile, the phenomenon is observed that distinct erasing preferences for the image structure and details are demonstrated by the EPR at different timesteps and layers. Therefore, we adopt a novel Time-Layer MOdulation process (TLMO) that adjusts the erasure scale of EPR module's outputs across different layers and timesteps, automatically balancing the erasure effects and model's generative ability. Our method achieves state-of-the-art performance on Explicit Content Erasure, Cartoon Concept Removal and Artistic Style Erasure, clearly outperforming alternative methods. Code is available at https://github.com/Maplebb/DuMo
中文: 提出的DuMo网络通过修改图像高频细节实现精准内容擦除,同时利用TLMO模块自适应调节擦除强度,在保护非目标概念的前提下取得了最优性能。
English: The proposed DuMo network effectively removes inappropriate content from images by modifying high-frequency details while preserving non-target concepts, achieving superior performance through its innovative EPR and TLMO modules.

Authors:Jian Lang, Zhangtao Cheng, Ting Zhong, Fan Zhou
Title: Retrieval-Augmented Dynamic Prompt Tuning for Incomplete Multimodal Learning
Abstract:
Multimodal learning with incomplete modality is practical and challenging. Recently, researchers have focused on enhancing the robustness of pre-trained MultiModal Transformers (MMTs) under missing modality conditions by applying learnable prompts. However, these prompt-based methods face several limitations: (1) incomplete modalities provide restricted modal cues for task-specific inference, (2) dummy imputation for missing content causes information loss and introduces noise, and (3) static prompts are instance-agnostic, offering limited knowledge for instances with various missing conditions. To address these issues, we propose RAGPT, a novel Retrieval-AuGmented dynamic Prompt Tuning framework. RAGPT comprises three modules: (I) the multi-channel retriever, which identifies similar instances through a within-modality retrieval strategy, (II) the missing modality generator, which recovers missing information using retrieved contexts, and (III) the context-aware prompter, which captures contextual knowledge from relevant instances and generates dynamic prompts to largely enhance the MMT's robustness. Extensive experiments conducted on three real-world datasets show that RAGPT consistently outperforms all competitive baselines in handling incomplete modality problems. The code of our work and prompt-based baselines is available at https://github.com/Jian-Lang/RAGPT.
中文摘要:提出的RAGPT框架通过检索增强的动态提示解决了多模态学习中模态缺失的难题,利用跨模态检索和动态提示生成显著提升了多模态变换器的鲁棒性,在三个真实数据集上均展现出最优性能。
English Summary: The proposed RAGPT framework addresses limitations in prompt-based multimodal learning by using retrieval-augmented dynamic prompts to enhance transformer robustness against missing modalities, demonstrating superior performance across three real-world datasets.

Authors:Jimin Park, AHyun Ji, Minji Park, Mohammad Saidur Rahman, Se Eun Oh
Title: MalCL: Leveraging GAN-Based Generative Replay to Combat Catastrophic Forgetting in Malware Classification
Abstract:
Continual Learning (CL) for malware classification tackles the rapidly evolving nature of malware threats and the frequent emergence of new types. Generative Replay (GR)-based CL systems utilize a generative model to produce synthetic versions of past data, which are then combined with new data to retrain the primary model. Traditional machine learning techniques in this domain often struggle with catastrophic forgetting, where a model's performance on old data degrades over time. In this paper, we introduce a GR-based CL system that employs Generative Adversarial Networks (GANs) with feature matching loss to generate high-quality malware samples. Additionally, we implement innovative selection schemes for replay samples based on the model's hidden representations. Our comprehensive evaluation across Windows and Android malware datasets in a class-incremental learning scenario -- where new classes are introduced continuously over multiple tasks -- demonstrates substantial performance improvements over previous methods. For example, our system achieves an average accuracy of 55% on Windows malware samples, significantly outperforming other GR-based models by 28%. This study provides practical insights for advancing GR-based malware classification systems. The implementation is available at \url {https://github.com/MalwareReplayGAN/MalCL}\footnote{The code will be made public upon the presentation of the paper}.
中文: 本文提出了一种基于生成回放的持续学习系统,采用带特征匹配损失的GAN和创新回放样本选择方案,在Windows恶意软件的类增量学习场景中实现了55%的平均准确率,比先前方法性能提升28%。
English: This paper introduces a Generative Replay-based Continual Learning system using GANs with feature matching loss and innovative replay selection schemes, achieving a 55% average accuracy on Windows malware and outperforming previous methods by 28% in class-incremental learning scenarios.

Authors:Haina Zhu, Yizhi Zhou, Hangting Chen, Jianwei Yu, Ziyang Ma, Rongzhi Gu, Yi Luo, Wei Tan, Xie Chen
Title: MuQ: Self-Supervised Music Representation Learning with Mel Residual Vector Quantization
Abstract:
Recent years have witnessed the success of foundation models pre-trained with self-supervised learning (SSL) in various music informatics understanding tasks, including music tagging, instrument classification, key detection, and more. In this paper, we propose a self-supervised music representation learning model for music understanding. Distinguished from previous studies adopting random projection or existing neural codec, the proposed model, named MuQ, is trained to predict tokens generated by Mel Residual Vector Quantization (Mel-RVQ). Our Mel-RVQ utilizes residual linear projection structure for Mel spectrum quantization to enhance the stability and efficiency of target extraction and lead to better performance. Experiments in a large variety of downstream tasks demonstrate that MuQ outperforms previous self-supervised music representation models with only 0.9K hours of open-source pre-training data. Scaling up the data to over 160K hours and adopting iterative training consistently improve the model performance. To further validate the strength of our model, we present MuQ-MuLan, a joint music-text embedding model based on contrastive learning, which achieves state-of-the-art performance in the zero-shot music tagging task on the MagnaTagATune dataset. Code and checkpoints are open source in https://github.com/tencent-ailab/MuQ.
Chinese: MuQ模型采用梅尔残差向量量化进行自监督音乐表示学习,仅用少量预训练数据即在多项任务中表现卓越,并能通过数据扩展持续提升性能。
English: The MuQ model introduces a self-supervised music representation learning approach using Mel Residual Vector Quantization, achieving superior performance across multiple tasks with minimal pre-training data and scaling effectively to larger datasets.

Authors:Shuo Yu, Shan Jin, Ming Li, Tabinda Sarwar, Feng Xia
Title: Long-range Brain Graph Transformer
Abstract:
Understanding communication and information processing among brain regions of interest (ROIs) is highly dependent on long-range connectivity, which plays a crucial role in facilitating diverse functional neural integration across the entire brain. However, previous studies generally focused on the short-range dependencies within brain networks while neglecting the long-range dependencies, limiting an integrated understanding of brain-wide communication. To address this limitation, we propose Adaptive Long-range aware TransformER (ALTER), a brain graph transformer to capture long-range dependencies between brain ROIs utilizing biased random walk. Specifically, we present a novel long-range aware strategy to explicitly capture long-range dependencies between brain ROIs. By guiding the walker towards the next hop with higher correlation value, our strategy simulates the real-world brain-wide communication. Furthermore, by employing the transformer framework, ALERT adaptively integrates both short- and long-range dependencies between brain ROIs, enabling an integrated understanding of multi-level communication across the entire brain. Extensive experiments on ABIDE and ADNI datasets demonstrate that ALTER consistently outperforms generalized state-of-the-art graph learning methods (including SAN, Graphormer, GraphTrans, and LRGNN) and other graph learning based brain network analysis methods (including FBNETGEN, BrainNetGNN, BrainGNN, and BrainNETTF) in neurological disease diagnosis. Cases of long-range dependencies are also presented to further illustrate the effectiveness of ALTER. The implementation is available at https://github.com/yushuowiki/ALTER.
中文: 本研究提出ALTER,一种通过偏置随机游走捕捉大脑区域间长程依赖性的脑图转换器,在神经疾病诊断中优于现有方法。
English: This study introduces ALTER, a brain graph transformer that captures long-range dependencies between brain regions using biased random walks, outperforming existing methods in neurological disease diagnosis.

Authors:Hong Zhang, Zhongjie Duan, Xingjun Wang, Yingda Chen, Yu Zhang
Title: EliGen: Entity-Level Controlled Image Generation with Regional Attention
Abstract:
Recent advancements in diffusion models have significantly advanced text-to-image generation, yet global text prompts alone remain insufficient for achieving fine-grained control over individual entities within an image. To address this limitation, we present EliGen, a novel framework for Entity-level controlled image Generation. Firstly, we put forward regional attention, a mechanism for diffusion transformers that requires no additional parameters, seamlessly integrating entity prompts and arbitrary-shaped spatial masks. By contributing a high-quality dataset with fine-grained spatial and semantic entity-level annotations, we train EliGen to achieve robust and accurate entity-level manipulation, surpassing existing methods in both spatial precision and image quality. Additionally, we propose an inpainting fusion pipeline, extending its capabilities to multi-entity image inpainting tasks. We further demonstrate its flexibility by integrating it with other open-source models such as IP-Adapter, In-Context LoRA and MLLM, unlocking new creative possibilities. The source code, model, and dataset are published at https://github.com/modelscope/DiffSynth-Studio.git.
中文: EliGen提出了一种新颖框架,通过区域注意力和高质量数据集实现精细的实体级图像控制,在精度和质量上超越现有方法,并能扩展到多实体任务及与其他模型集成。
English: EliGen introduces a novel framework with regional attention and a high-quality dataset to enable fine-grained entity-level image control, surpassing existing methods in precision and quality while extending to multi-entity tasks and integration with other models.

Authors:Xiaohui Chen, Yinkai Wang, Jiaxing He, Yuanqi Du, Soha Hassoun, Xiaolin Xu, Li-Ping Liu
Title: Graph Generative Pre-trained Transformer
Abstract:
Graph generation is a critical task in numerous domains, including molecular design and social network analysis, due to its ability to model complex relationships and structured data. While most modern graph generative models utilize adjacency matrix representations, this work revisits an alternative approach that represents graphs as sequences of node set and edge set. We advocate for this approach due to its efficient encoding of graphs and propose a novel representation. Based on this representation, we introduce the Graph Generative Pre-trained Transformer (G2PT), an auto-regressive model that learns graph structures via next-token prediction. To further exploit G2PT's capabilities as a general-purpose foundation model, we explore fine-tuning strategies for two downstream applications: goal-oriented generation and graph property prediction. We conduct extensive experiments across multiple datasets. Results indicate that G2PT achieves superior generative performance on both generic graph and molecule datasets. Furthermore, G2PT exhibits strong adaptability and versatility in downstream tasks from molecular design to property prediction. Code available at https://github.com/tufts-ml/G2PT,
Chinese: 本研究提出了图生成预训练变换器(G2PT),该自回归模型采用新颖的序列化图表示方法,在分子设计和性质预测等下游任务中展现出卓越的生成性能和强大的适应能力。
English: This work introduces the Graph Generative Pre-trained Transformer (G2PT), an auto-regressive model that uses a novel sequence-based graph representation to achieve superior generative performance and strong adaptability in downstream tasks like molecular design and property prediction.

Authors:Md Osama, Ashim Dey, Kawsar Ahmed, Muhammad Ashad Kabir
Title: BeliN: A Novel Corpus for Bengali Religious News Headline Generation using Contextual Feature Fusion
Abstract:
Automatic text summarization, particularly headline generation, remains a critical yet underexplored area for Bengali religious news. Existing approaches to headline generation typically rely solely on the article content, overlooking crucial contextual features such as sentiment, category, and aspect. This limitation significantly hinders their effectiveness and overall performance. This study addresses this limitation by introducing a novel corpus, BeliN (Bengali Religious News) - comprising religious news articles from prominent Bangladeshi online newspapers, and MultiGen - a contextual multi-input feature fusion headline generation approach. Leveraging transformer-based pre-trained language models such as BanglaT5, mBART, mT5, and mT0, MultiGen integrates additional contextual features - including category, aspect, and sentiment - with the news content. This fusion enables the model to capture critical contextual information often overlooked by traditional methods. Experimental results demonstrate the superiority of MultiGen over the baseline approach that uses only news content, achieving a BLEU score of 18.61 and ROUGE-L score of 24.19, compared to baseline approach scores of 16.08 and 23.08, respectively. These findings underscore the importance of incorporating contextual features in headline generation for low-resource languages. By bridging linguistic and cultural gaps, this research advances natural language processing for Bengali and other underrepresented languages. To promote reproducibility and further exploration, the dataset and implementation code are publicly accessible at https://github.com/akabircs/BeliN.
Chinese: 本研究提出了一种新颖的上下文多输入特征融合方法MultiGen,用于孟加拉语宗教新闻标题生成,通过整合情感、类别和方面特征,在BLEU和ROUGE-L得分上优于仅基于内容的方法。
English: This study introduces a novel contextual multi-input feature fusion approach, MultiGen, for Bengali religious news headline generation, which outperforms content-only methods by integrating sentiment, category, and aspect features, achieving superior BLEU and ROUGE-L scores.

Authors:Youngjun Son, Chaewon Kim, Jaejin Lee
Title: FED: Fast and Efficient Dataset Deduplication Framework with GPU Acceleration
Abstract:
Dataset deduplication plays a crucial role in enhancing data quality, ultimately improving the training performance and efficiency of large language models. A commonly used method for data deduplication is the MinHash LSH algorithm. Recently, NVIDIA introduced a GPU-based MinHash LSH deduplication method, but it remains suboptimal, leaving room for further improvement in processing efficiency. This paper proposes a GPU-accelerated deduplication framework, FED, that optimizes MinHash LSH for GPU clusters and leverages computationally efficient, partially reusable non-cryptographic hash functions. FED significantly outperforms the CPU-based deduplication tool in SlimPajama (using 64 logical CPU cores) by up to 107.2 times and the GPU-based tool in NVIDIA NeMo Curator by up to 6.3 times when processing 30 million documents on a node with four GPUs. Notably, our method dramatically accelerates the previously time-consuming MinHash signature generation phase, achieving speed-ups of up to 260 compared to the CPU baseline. Despite these gains in efficiency, FED maintains high deduplication quality, with the duplicate document sets reaching a Jaccard similarity of over 0.96 compared to those identified by the standard MinHash algorithm. In large-scale experiments, the deduplication of 1.2 trillion tokens is completed in just 6 hours in a four-node, 16-GPU environment. The related code is publicly available on GitHub (\href{https://github.com/mcrl/FED}{https://github.com/mcrl/FED}).
中文: FED框架通过GPU优化的MinHash LSH和高效哈希函数显著加速数据集去重,在保持高质量去重的同时,处理速度比CPU工具快107.2倍,比GPU方案快6.3倍。
English: The FED framework significantly accelerates dataset deduplication using GPU-optimized MinHash LSH and efficient hash functions, achieving up to 107.2x speed over CPU tools and 6.3x over GPU alternatives while maintaining high deduplication quality.

Authors:Bin Wang, Xunlong Zou, Shuo Sun, Wenyu Zhang, Yingxu He, Zhuohan Liu, Chengwei Wei, Nancy F. Chen, AiTi Aw
Title: Advancing Singlish Understanding: Bridging the Gap with Datasets and Multimodal Models
Abstract:
Singlish, a Creole language rooted in English, is a key focus in linguistic research within multilingual and multicultural contexts. However, its spoken form remains underexplored, limiting insights into its linguistic structure and applications. To address this gap, we standardize and annotate the largest spoken Singlish corpus, introducing the Multitask National Speech Corpus (MNSC). These datasets support diverse tasks, including Automatic Speech Recognition (ASR), Spoken Question Answering (SQA), Spoken Dialogue Summarization (SDS), and Paralinguistic Question Answering (PQA). We release standardized splits and a human-verified test set to facilitate further research. Additionally, we propose SingAudioLLM, a multi-task multimodal model leveraging multimodal large language models to handle these tasks concurrently. Experiments reveal our models adaptability to Singlish context, achieving state-of-the-art performance and outperforming prior models by 10-30% in comparison with other AudioLLMs and cascaded solutions.
中文摘要:本研究推出了最大的标准化标注口语新加坡英语语料库——多任务国家语音语料库(MNSC),并提出了多任务多模态模型SingAudioLLM,该模型在多项语音处理任务中表现优异,性能超越先前模型10-30%,达到当前最优水平。
English Summary: This study introduces the largest standardized and annotated spoken Singlish corpus, the Multitask National Speech Corpus (MNSC), along with a multi-task multimodal model called SingAudioLLM, which achieves state-of-the-art performance by outperforming previous models by 10-30% across various speech processing tasks.

Authors:Ziyang Chen, Wenting Li, Yongjun Zhang, Yabo Wu, Bingshu Wang, Yong Zhao, C. L. Philip Chen
Title: Hadamard Attention Recurrent Transformer: A Strong Baseline for Stereo Matching Transformer
Abstract:
Constrained by the low-rank bottleneck inherent in attention mechanisms, current stereo matching transformers suffer from limited nonlinear expressivity, which renders their feature representations sensitive to challenging conditions such as reflections. To overcome this difficulty, we present the Hadamard Attention Recurrent Stereo Transformer (HART). HART includes a novel attention mechanism that incorporates the following components: 1) The Dense Attention Kernel (DAK) maps the attention weight distribution into a high-dimensional space over (0, +$\infty$). By removing the upper bound constraint on attention weights, DAK enables more flexible modeling of complex feature interactions. This reduces feature collinearity. 2) The Multi Kernel & Order Interaction (MKOI) module extends the attention mechanism by unifying semantic and spatial knowledge learning. This integration improves the ability of HART to learn features in binocular images. Experimental results demonstrate the effectiveness of our HART. In reflective area, HART ranked 1st on the KITTI 2012 benchmark among all published methods at the time of submission. Code is available at https://github.com/ZYangChen/HART.
中文: 现有立体匹配变换器因注意力机制的低秩瓶颈而存在非线性表达能力受限的问题,对反射等挑战性条件敏感,但提出的HART模型通过移除注意力权重上限约束并融合语义-空间学习的新型注意力机制,在KITTI 2012基准测试中取得了最优性能。
English: Current stereo matching transformers face limited nonlinear expressivity due to low-rank attention bottlenecks, making them sensitive to challenging conditions like reflections, but the proposed HART model overcomes this with a novel attention mechanism that removes upper bound constraints and integrates semantic-spatial learning, achieving top performance on the KITTI 2012 benchmark.

Authors:Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, Luis Ceze
Title: FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving
Abstract:
Transformers, driven by attention mechanisms, form the foundation of large language models (LLMs). As these models scale up, efficient GPU attention kernels become essential for high-throughput and low-latency inference. Diverse LLM applications demand flexible and high-performance attention solutions. We present FlashInfer: a customizable and efficient attention engine for LLM serving. FlashInfer tackles KV-cache storage heterogeneity using block-sparse format and composable formats to optimize memory access and reduce redundancy. It also offers a customizable attention template, enabling adaptation to various settings through Just-In-Time (JIT) compilation. Additionally, FlashInfer's load-balanced scheduling algorithm adjusts to dynamism of user requests while maintaining compatibility with CUDAGraph which requires static configuration. FlashInfer have been integrated into leading LLM serving frameworks like SGLang, vLLM and MLC-Engine. Comprehensive kernel-level and end-to-end evaluations demonstrate FlashInfer's ability to significantly boost kernel performance across diverse inference scenarios: compared to state-of-the-art LLM serving solutions, FlashInfer achieve 29-69% inter-token-latency reduction compared to compiler backends for LLM serving benchmark, 28-30% latency reduction for long-context inference, and 13-17% speedup for LLM serving with parallel generation.
Chinese: FlashInfer 是一个用于大语言模型服务的可定制高效注意力引擎,通过块稀疏 KV 缓存存储、即时编译和负载均衡调度优化内存访问与性能,在各种推理场景中显著降低了延迟。
English: FlashInfer is a customizable and efficient attention engine for LLM serving that optimizes memory access and performance through block-sparse KV-cache storage, JIT compilation, and load-balanced scheduling, achieving significant latency reductions across various inference scenarios.

Authors:Wenqi Zhang, Hang Zhang, Xin Li, Jiashuo Sun, Yongliang Shen, Weiming Lu, Deli Zhao, Yueting Zhuang, Lidong Bing
Title: 2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining
Abstract:
Compared to image-text pair data, interleaved corpora enable Vision-Language Models (VLMs) to understand the world more naturally like humans. However, such existing datasets are crawled from webpage, facing challenges like low knowledge density, loose image-text relations, and poor logical coherence between images. On the other hand, the internet hosts vast instructional videos (e.g., online geometry courses) that are widely used by humans to learn foundational subjects, yet these valuable resources remain underexplored in VLM training. In this paper, we introduce a high-quality \textbf{multimodal textbook} corpus with richer foundational knowledge for VLM pretraining. It collects over 2.5 years of instructional videos, totaling 22,000 class hours. We first use an LLM-proposed taxonomy to systematically gather instructional videos. Then we progressively extract and refine visual (keyframes), audio (ASR), and textual knowledge (OCR) from the videos, and organize as an image-text interleaved corpus based on temporal order. Compared to its counterparts, our video-centric textbook offers more coherent context, richer knowledge, and better image-text alignment. Experiments demonstrate its superb pretraining performance, particularly in knowledge- and reasoning-intensive tasks like ScienceQA and MathVista. Moreover, VLMs pre-trained on our textbook exhibit outstanding interleaved context awareness, leveraging visual and textual cues in their few-shot context for task solving. Our code are available at https://github.com/DAMO-NLP-SG/multimodal_textbook.
中文摘要:本文提出了一种从教学视频中提取的高质量多模态教材语料库,为视觉语言模型提供了更丰富的知识基础和更好的图文对齐,显著提升了其在知识密集型和推理任务中的表现。
English Summary: This paper introduces a high-quality multimodal textbook corpus derived from instructional videos, which provides richer foundational knowledge and better image-text alignment for Vision-Language Models, significantly enhancing their performance in knowledge-intensive and reasoning tasks.

Authors:David Wu, Sanjiban Choudhury
Title: Aligning LLMs with Domain Invariant Reward Models
Abstract:
Aligning large language models (LLMs) to human preferences is challenging in domains where preference data is unavailable. We address the problem of learning reward models for such target domains by leveraging feedback collected from simpler source domains, where human preferences are easier to obtain. Our key insight is that, while domains may differ significantly, human preferences convey \emph{domain-agnostic} concepts that can be effectively captured by a reward model. We propose \method, a framework that trains domain-invariant reward models by optimizing a dual loss: a domain loss that minimizes the divergence between source and target distribution, and a source loss that optimizes preferences on the source domain. We show \method is a general approach that we evaluate and analyze across 4 distinct settings: (1) Cross-lingual transfer (accuracy: $0.621 \rightarrow 0.661$), (2) Clean-to-noisy (accuracy: $0.671 \rightarrow 0.703$), (3) Few-shot-to-full transfer (accuracy: $0.845 \rightarrow 0.920$), and (4) Simple-to-complex tasks transfer (correlation: $0.508 \rightarrow 0.556$). Our code, models and data are available at \url{https://github.com/portal-cornell/dial}.
Chinese: 本研究提出了一种训练领域不变奖励模型的框架,通过利用从较简单领域收集的反馈来在缺乏人类偏好数据的目标领域中调整大语言模型,并在多种设置下实现了性能提升。
English: This study introduces a framework for training domain-invariant reward models by using feedback from simpler domains to align LLMs in target domains where human preference data is unavailable, achieving improved performance across diverse settings.

Authors:Libin Lan, Lu Jiang, Tianshu Yu, Xiaojuan Liu, Zhongshi He
Title: FullTransNet: Full Transformer with Local-Global Attention for Video Summarization
Abstract:
Video summarization aims to generate a compact, informative, and representative synopsis of raw videos, which is crucial for browsing, analyzing, and understanding video content. Dominant approaches in video summarization primarily rely on recurrent or convolutional neural networks, and more recently on encoder-only transformer architectures. However, these methods typically suffer from several limitations in parallelism, modeling long-range dependencies, and providing explicit generative capabilities. To address these issues, we propose a transformer-like architecture named FullTransNet with two-fold ideas. First, it uses a full transformer with an encoder-decoder structure as an alternative architecture for video summarization. As the full transformer is specifically designed for sequence transduction tasks, its direct application to video summarization is both intuitive and effective. Second, it replaces the standard full attention mechanism with a combination of local and global sparse attention, enabling the model to capture long-range dependencies while significantly reducing computational costs. This local-global sparse attention is applied exclusively at the encoder side, where the majority of computations occur, further enhancing efficiency. Extensive experiments on two widely used benchmark datasets, SumMe and TVSum, demonstrate that our model achieves F-scores of 54.4% and 63.9%, respectively, while maintaining relatively low computational and memory requirements. These results surpass the second-best performing methods by 0.1% and 0.3%, respectively, verifying the effectiveness and efficiency of FullTransNet.
中文: FullTransNet 提出了一种基于Transformer的编码器-解码器架构,结合局部-全局稀疏注意力机制,在降低计算成本的同时,于标准数据集上实现了最优的视频摘要生成性能。
English: FullTransNet introduces a transformer-based encoder-decoder architecture with local-global sparse attention to efficiently generate video summaries, achieving state-of-the-art performance on benchmark datasets with reduced computational costs.

Authors:Teng Hu, Jiangning Zhang, Ran Yi, Jieyu Weng, Yabiao Wang, Xianfang Zeng, Zhucun Xue, Lizhuang Ma
Title: Improving Autoregressive Visual Generation with Cluster-Oriented Token Prediction
Abstract:
Employing LLMs for visual generation has recently become a research focus. However, the existing methods primarily transfer the LLM architecture to visual generation but rarely investigate the fundamental differences between language and vision. This oversight may lead to suboptimal utilization of visual generation capabilities within the LLM framework. In this paper, we explore the characteristics of visual embedding space under the LLM framework and discover that the correlation between visual embeddings can help achieve more stable and robust generation results. We present IAR, an Improved AutoRegressive Visual Generation Method that enhances the training efficiency and generation quality of LLM-based visual generation models. Firstly, we propose a Codebook Rearrangement strategy that uses balanced k-means clustering algorithm to rearrange the visual codebook into clusters, ensuring high similarity among visual features within each cluster. Leveraging the rearranged codebook, we propose a Cluster-oriented Cross-entropy Loss that guides the model to correctly predict the cluster where the token is located. This approach ensures that even if the model predicts the wrong token index, there is a high probability the predicted token is located in the correct cluster, which significantly enhances the generation quality and robustness. Extensive experiments demonstrate that our method consistently enhances the model training efficiency and performance from 100M to 1.4B, reducing the training time by half while achieving the same FID. Additionally, our approach can be applied to various LLM-based visual generation models and adheres to the scaling law, providing a promising direction for future research in LLM-based visual generation. The code is available at: https://github.com/sjtuplayer/IAR.
中文: 本文提出IAR改进自回归视觉生成方法,通过重组视觉码本并采用面向簇的交叉熵损失,有效提升了基于大语言模型的视觉生成效率与质量。
English: This paper introduces IAR, an improved autoregressive visual generation method that enhances LLM-based visual generation by rearranging the codebook and using cluster-oriented loss to boost training efficiency and output quality.

Authors:Mingjia Li, Shuang Li, Tongrui Su, Longhui Yuan, Jian Liang, Wei Li
Title: Exploring Structured Semantic Priors Underlying Diffusion Score for Test-time Adaptation
Abstract:
Capitalizing on the complementary advantages of generative and discriminative models has always been a compelling vision in machine learning, backed by a growing body of research. This work discloses the hidden semantic structure within score-based generative models, unveiling their potential as effective discriminative priors. Inspired by our theoretical findings, we propose DUSA to exploit the structured semantic priors underlying diffusion score to facilitate the test-time adaptation of image classifiers or dense predictors. Notably, DUSA extracts knowledge from a single timestep of denoising diffusion, lifting the curse of Monte Carlo-based likelihood estimation over timesteps. We demonstrate the efficacy of our DUSA in adapting a wide variety of competitive pre-trained discriminative models on diverse test-time scenarios. Additionally, a thorough ablation study is conducted to dissect the pivotal elements in DUSA. Code is publicly available at https://github.com/BIT-DA/DUSA.
Chinese Summary: 本研究提出DUSA方法,利用基于分数的生成模型中的语义先验,仅通过单个扩散步骤即可实现判别模型的高效测试时适应。
English Summary: This research introduces DUSA, a method that leverages the semantic priors in score-based generative models to enable efficient test-time adaptation of discriminative models using just a single diffusion timestep.

Authors:Nicholas Magal, Minh Tran, Riku Arakawa, Suzanne Nie
Title: Negative to Positive Co-learning with Aggressive Modality Dropout
Abstract:
This paper aims to document an effective way to improve multimodal co-learning by using aggressive modality dropout. We find that by using aggressive modality dropout we are able to reverse negative co-learning (NCL) to positive co-learning (PCL). Aggressive modality dropout can be used to "prep" a multimodal model for unimodal deployment, and dramatically increases model performance during negative co-learning, where during some experiments we saw a 20% gain in accuracy. We also benchmark our modality dropout technique against PCL to show that our modality drop out technique improves co-learning during PCL, although it does not have as much as an substantial effect as it does during NCL. Github: https://github.com/nmagal/modality_drop_for_colearning
中文摘要:本文通过采用激进的模态丢弃方法,成功将负协同学习转化为正协同学习,使模型准确率最高提升20%,并为单模态部署做好了准备。
English Summary: This paper demonstrates that aggressive modality dropout effectively reverses negative co-learning into positive co-learning, significantly boosting model accuracy by up to 20% and preparing models for unimodal deployment.

Authors:Van Quang Nguyen, Quoc Chuong Nguyen, Thu Huong Dang, Truong-Son Hy
Title: Hybridising Reinforcement Learning and Heuristics for Hierarchical Directed Arc Routing Problems
Abstract:
The Hierarchical Directed Capacitated Arc Routing Problem (HDCARP) is an extension of the Capacitated Arc Routing Problem (CARP), where the arcs of a graph are divided into classes based on their priority. The traversal of these classes is determined by either precedence constraints or a hierarchical objective, resulting in two distinct HDCARP variants. To the best of our knowledge, only one matheuristic has been proposed for these variants, but it performs relatively slowly, particularly for large-scale instances (Ha et al., 2024). In this paper, we propose a fast heuristic to efficiently address the computational challenges of HDCARP. Furthermore, we incorporate Reinforcement Learning (RL) into our heuristic to effectively guide the selection of local search operators, resulting in a hybrid algorithm. We name this hybrid algorithm as the Hybrid Reinforcement Learning and Heuristic Algorithm for Directed Arc Routing (HRDA). The hybrid algorithm adapts to changes in the problem dynamically, using real-time feedback to improve routing strategies and solution's quality by integrating heuristic methods. Extensive computational experiments on artificial instances demonstrate that this hybrid approach significantly improves the speed of the heuristic without deteriorating the solution quality. Our source code is publicly available at: https://github.com/HySonLab/ArcRoute
Chinese: 本文提出HRDA混合算法,将强化学习与启发式方法相结合,有效解决分层有向容量弧路径问题,在保持解质量的同时显著提升了计算速度。
English: This paper introduces HRDA, a hybrid algorithm combining reinforcement learning with heuristics to efficiently solve the Hierarchical Directed Capacitated Arc Routing Problem, significantly improving computational speed while maintaining solution quality.

Authors:Yulong Ye, Tao Chen, Miqing Li
Title: Distilled Lifelong Self-Adaptation for Configurable Systems
Abstract:
Modern configurable systems provide tremendous opportunities for engineering future intelligent software systems. A key difficulty thereof is how to effectively self-adapt the configuration of a running system such that its performance (e.g., runtime and throughput) can be optimized under time-varying workloads. This unfortunately remains unaddressed in existing approaches as they either overlook the available past knowledge or rely on static exploitation of past knowledge without reasoning the usefulness of information when planning for self-adaptation. In this paper, we tackle this challenging problem by proposing DLiSA, a framework that self-adapts configurable systems. DLiSA comes with two properties: firstly, it supports lifelong planning, and thereby the planning process runs continuously throughout the lifetime of the system, allowing dynamic exploitation of the accumulated knowledge for rapid adaptation. Secondly, the planning for a newly emerged workload is boosted via distilled knowledge seeding, in which the knowledge is dynamically purified such that only useful past configurations are seeded when necessary, mitigating misleading information. Extensive experiments suggest that the proposed DLiSA significantly outperforms state-of-the-art approaches, demonstrating a performance improvement of up to 229% and a resource acceleration of up to 2.22x on generating promising adaptation configurations. All data and sources can be found at our repository: https://github.com/ideas-labo/dlisa.
Chinese: DLiSA框架通过支持终身规划和动态纯化历史知识来引导有效的自我适应,解决了在变化工作负载下优化可配置系统性能的难题,其表现显著优于现有方法。
English: The DLiSA framework addresses the challenge of optimizing configurable systems' performance under varying workloads by enabling lifelong planning and dynamically purifying past knowledge to guide effective self-adaptation, significantly outperforming existing methods.

Authors:Binglu Wang, Yao Tian, Shunzhou Wang, Le Yang
Title: Multimodal Large Models Are Effective Action Anticipators
Abstract:
The task of long-term action anticipation demands solutions that can effectively model temporal dynamics over extended periods while deeply understanding the inherent semantics of actions. Traditional approaches, which primarily rely on recurrent units or Transformer layers to capture long-term dependencies, often fall short in addressing these challenges. Large Language Models (LLMs), with their robust sequential modeling capabilities and extensive commonsense knowledge, present new opportunities for long-term action anticipation. In this work, we introduce the ActionLLM framework, a novel approach that treats video sequences as successive tokens, leveraging LLMs to anticipate future actions. Our baseline model simplifies the LLM architecture by setting future tokens, incorporating an action tuning module, and reducing the textual decoder layer to a linear layer, enabling straightforward action prediction without the need for complex instructions or redundant descriptions. To further harness the commonsense reasoning of LLMs, we predict action categories for observed frames and use sequential textual clues to guide semantic understanding. In addition, we introduce a Cross-Modality Interaction Block, designed to explore the specificity within each modality and capture interactions between vision and textual modalities, thereby enhancing multimodal tuning. Extensive experiments on benchmark datasets demonstrate the superiority of the proposed ActionLLM framework, encouraging a promising direction to explore LLMs in the context of action anticipation. Code is available at https://github.com/2tianyao1/ActionLLM.git.
Chinese: ActionLLM框架利用大型语言模型将视频序列视为连续标记,通过动作调整模块和跨模态交互块增强多模态理解,在基准数据集上展现出卓越的长期动作预测性能。
English: The ActionLLM framework leverages large language models to treat video sequences as tokens for long-term action anticipation, incorporating an action tuning module and cross-modality interaction to enhance multimodal understanding and achieve superior performance on benchmarks.

Authors:Suho Park, SuBeen Lee, Hyun Seok Seong, Jaejoon Yoo, Jae-Pil Heo
Title: Foreground-Covering Prototype Generation and Matching for SAM-Aided Few-Shot Segmentation
Abstract:
We propose Foreground-Covering Prototype Generation and Matching to resolve Few-Shot Segmentation (FSS), which aims to segment target regions in unlabeled query images based on labeled support images. Unlike previous research, which typically estimates target regions in the query using support prototypes and query pixels, we utilize the relationship between support and query prototypes. To achieve this, we utilize two complementary features: SAM Image Encoder features for pixel aggregation and ResNet features for class consistency. Specifically, we construct support and query prototypes with SAM features and distinguish query prototypes of target regions based on ResNet features. For the query prototype construction, we begin by roughly guiding foreground regions within SAM features using the conventional pseudo-mask, then employ iterative cross-attention to aggregate foreground features into learnable tokens. Here, we discover that the cross-attention weights can effectively alternate the conventional pseudo-mask. Therefore, we use the attention-based pseudo-mask to guide ResNet features to focus on the foreground, then infuse the guided ResNet feature into the learnable tokens to generate class-consistent query prototypes. The generation of the support prototype is conducted symmetrically to that of the query one, with the pseudo-mask replaced by the ground-truth mask. Finally, we compare these query prototypes with support ones to generate prompts, which subsequently produce object masks through the SAM Mask Decoder. Our state-of-the-art performances on various datasets validate the effectiveness of the proposed method for FSS. Our official code is available at https://github.com/SuhoPark0706/FCP
中文: 本研究提出前景覆盖原型生成与匹配方法,通过结合SAM和ResNet特征构建类别一致的原型,并利用交叉注意力优化伪掩码,在多个数据集上实现了最先进的少样本分割性能。
English: This study introduces Foreground-Covering Prototype Generation and Matching for Few-Shot Segmentation, leveraging SAM and ResNet features to create class-consistent prototypes and using cross-attention to refine pseudo-masks, achieving state-of-the-art results across multiple datasets.

Authors:Haoxuan Li, Wei song, Peiwu Qin, Xi Yuan, Zhenglin Chen
Title: HCMA-UNet: A Hybrid CNN-Mamba UNet with Axial Self-Attention for Efficient Breast Cancer Segmentation
Abstract:
Breast cancer lesion segmentation in DCE-MRI remains challenging due to heterogeneous tumor morphology and indistinct boundaries. To address these challenges, this study proposes a novel hybrid segmentation network, HCMA-UNet, for lesion segmentation of breast cancer. Our network consists of a lightweight CNN backbone and a Multi-view Axial Self-Attention Mamba (MISM) module. The MISM module integrates Visual State Space Block (VSSB) and Axial Self-Attention (ASA) mechanism, effectively reducing parameters through Asymmetric Split Channel (ASC) strategy to achieve efficient tri-directional feature extraction. Our lightweight model achieves superior performance with 2.87M parameters and 126.44 GFLOPs. A Feature-guided Region-aware loss function (FRLoss) is proposed to enhance segmentation accuracy. Extensive experiments on one private and two public DCE-MRI breast cancer datasets demonstrate that our approach achieves state-of-the-art performance while maintaining computational efficiency. FRLoss also exhibits good cross-architecture generalization capabilities. The source code is available at https://github.com/Haoxuanli-Thu/HCMA-UNet.
Chinese: 本研究提出轻量级混合网络HCMA-UNet,通过新型MISM模块和FRLoss损失函数,在保持计算效率的同时实现了DCE-MRI乳腺癌病灶分割的最优性能。
English: This study introduces HCMA-UNet, a lightweight hybrid network with a novel MISM module and FRLoss function, achieving state-of-the-art breast cancer lesion segmentation in DCE-MRI while maintaining computational efficiency.

Authors:Yiwei Qin, Yixiu Liu, Pengfei Liu
Title: DIVE: Diversified Iterative Self-Improvement
Abstract:
Recent advances in large language models (LLMs) have demonstrated the effectiveness of Iterative Self-Improvement (ISI) techniques. However, continuous training on self-generated data leads to reduced output diversity, a limitation particularly critical in reasoning tasks where diverse solution paths are essential. We present DIVE (Diversified Iterative Self-Improvement), a novel framework that addresses this challenge through two key components: Sample Pool Expansion for broader solution exploration, and Data Selection for balancing diversity and quality in preference pairs. Experiments on MATH and GSM8k datasets show that DIVE achieves a 10% to 45% relative increase in output diversity metrics while maintaining performance quality compared to vanilla ISI. Our ablation studies confirm both components' significance in achieving these improvements. Code is available at https://github.com/qinyiwei/DIVE.
中文: DIVE框架通过扩展样本池和平衡多样性及质量的数据选择,改进了大型语言模型的迭代自我优化,在推理任务中显著提升了输出多样性,同时保持了性能水平。
English: The DIVE framework enhances iterative self-improvement in large language models by expanding sample pools and selecting data to balance diversity and quality, achieving significant gains in output diversity without compromising performance on reasoning tasks.

Authors:Mengran Li, Chaojun Ding, Junzhou Chen, Wenbin Xing, Cong Ye, Ronghui Zhang, Songlin Zhuang, Jia Hu, Tony Z. Qiu, Huijun Gao
Title: AttriReBoost: A Gradient-Free Propagation Optimization Method for Cold Start Mitigation in Attribute Missing Graphs
Abstract:
Missing attribute issues are prevalent in the graph learning, leading to biased outcomes in Graph Neural Networks (GNNs). Existing methods that rely on feature propagation are prone to cold start problem, particularly when dealing with attribute resetting and low-degree nodes, which hinder effective propagation and convergence. To address these challenges, we propose AttriReBoost (ARB), a novel method that incorporates propagation-based method to mitigate cold start problems in attribute-missing graphs. ARB enhances global feature propagation by redefining initial boundary conditions and strategically integrating virtual edges, thereby improving node connectivity and ensuring more stable and efficient convergence. This method facilitates gradient-free attribute reconstruction with lower computational overhead. The proposed method is theoretically grounded, with its convergence rigorously established. Extensive experiments on several real-world benchmark datasets demonstrate the effectiveness of ARB, achieving an average accuracy improvement of 5.11% over state-of-the-art methods. Additionally, ARB exhibits remarkable computational efficiency, processing a large-scale graph with 2.49 million nodes in just 16 seconds on a single GPU. Our code is available at https://github.com/limengran98/ARB.
中文: 提出的AttriReBoost方法通过重新定义边界条件和策略性添加虚拟边来增强全局特征传播,有效解决了属性缺失图中的冷启动问题,在显著提升精度的同时保持了优异的计算效率。
English: The proposed AttriReBoost method effectively addresses cold start problems in attribute-missing graphs by enhancing global feature propagation through redefined boundary conditions and virtual edges, achieving significant accuracy improvements and computational efficiency.

Authors:Ruibin Li, Tao Yang, Song Guo, Lei Zhang
Title: RORem: Training a Robust Object Remover with Human-in-the-Loop
Abstract:
Despite the significant advancements, existing object removal methods struggle with incomplete removal, incorrect content synthesis and blurry synthesized regions, resulting in low success rates. Such issues are mainly caused by the lack of high-quality paired training data, as well as the self-supervised training paradigm adopted in these methods, which forces the model to in-paint the masked regions, leading to ambiguity between synthesizing the masked objects and restoring the background. To address these issues, we propose a semi-supervised learning strategy with human-in-the-loop to create high-quality paired training data, aiming to train a Robust Object Remover (RORem). We first collect 60K training pairs from open-source datasets to train an initial object removal model for generating removal samples, and then utilize human feedback to select a set of high-quality object removal pairs, with which we train a discriminator to automate the following training data generation process. By iterating this process for several rounds, we finally obtain a substantial object removal dataset with over 200K pairs. Fine-tuning the pre-trained stable diffusion model with this dataset, we obtain our RORem, which demonstrates state-of-the-art object removal performance in terms of both reliability and image quality. Particularly, RORem improves the object removal success rate over previous methods by more than 18\%. The dataset, source code and trained model are available at https://github.com/leeruibin/RORem.
Existing object removal methods face challenges like incomplete removal and blurry synthesis due to limited training data and ambiguous self-supervised learning, prompting the development of RORem through a semi-supervised approach with human feedback that achieves over 18% higher success rates.
English Summary:

Authors:Chuanting Zhang, Haixia Zhang, Shuping Dang, Basem Shihada, Mohamed-Slim Alouini
Title: Gradient Compression and Correlation Driven Federated Learning for Wireless Traffic Prediction
Abstract:
Wireless traffic prediction plays an indispensable role in cellular networks to achieve proactive adaptation for communication systems. Along this line, Federated Learning (FL)-based wireless traffic prediction at the edge attracts enormous attention because of the exemption from raw data transmission and enhanced privacy protection. However FL-based wireless traffic prediction methods still rely on heavy data transmissions between local clients and the server for local model updates. Besides, how to model the spatial dependencies of local clients under the framework of FL remains uncertain. To tackle this, we propose an innovative FL algorithm that employs gradient compression and correlation-driven techniques, effectively minimizing data transmission load while preserving prediction accuracy. Our approach begins with the introduction of gradient sparsification in wireless traffic prediction, allowing for significant data compression during model training. We then implement error feedback and gradient tracking methods to mitigate any performance degradation resulting from this compression. Moreover, we develop three tailored model aggregation strategies anchored in gradient correlation, enabling the capture of spatial dependencies across diverse clients. Experiments have been done with two real-world datasets and the results demonstrate that by capturing the spatio-temporal characteristics and correlation among local clients, the proposed algorithm outperforms the state-of-the-art algorithms and can increase the communication efficiency by up to two orders of magnitude without losing prediction accuracy. Code is available at https://github.com/chuanting/FedGCC.
中文: 该联邦学习算法采用梯度压缩和相关性驱动的聚合策略,在保持预测精度的同时大幅降低数据传输量,通过捕捉客户端间的时空依赖关系超越了现有方法。
English: The proposed federated learning algorithm utilizes gradient compression and correlation-driven aggregation to significantly reduce data transmission while maintaining prediction accuracy, outperforming existing methods by capturing spatio-temporal dependencies across clients.

Authors:Jiajun Zhu, Peihao Wang, Ruisi Cai, Jason D. Lee, Pan Li, Zhangyang Wang
Title: Rethinking Addressing in Language Models via Contexualized Equivariant Positional Encoding
Abstract:
Transformers rely on both content-based and position-based addressing mechanisms to make predictions, but existing positional encoding techniques often diminish the effectiveness of position-based addressing. Many current methods enforce rigid patterns in attention maps, limiting the ability to model long-range dependencies and adapt to diverse tasks. Additionally, most positional encodings are learned as general biases, lacking the specialization required for different instances within a dataset. To address this, we propose con\textbf{T}extualized equivari\textbf{A}nt \textbf{P}osition \textbf{E}ncoding (\textbf{TAPE}), a novel framework that enhances positional embeddings by incorporating sequence content across layers. TAPE introduces dynamic, context-aware positional encodings, overcoming the constraints of traditional fixed patterns. We show that TAPE can provably facilitate LLM reasoning ability by emulating a broader class of algorithms. By enforcing permutation and orthogonal equivariance, TAPE ensures the stability of positional encodings during updates, improving long-context ability. Our method can be easily integrated into pre-trained transformers, offering parameter-efficient fine-tuning with minimal overhead. Extensive experiments show that TAPE achieves superior performance in language modeling, arithmetic reasoning, and long-context retrieval tasks compared to existing positional embedding techniques. Code is available at https://github.com/VITA-Group/TAPE.
中文: 提出的TAPE框架通过在各层融入序列内容,引入动态、上下文感知的位置编码,增强了Transformer的长上下文建模和推理能力,同时利用等变性确保稳定性,从而提升整体性能。
English: The proposed TAPE framework introduces dynamic, context-aware positional encodings that enhance transformer performance by incorporating sequence content across layers, improving long-context modeling and reasoning abilities while ensuring stability through equivariance properties.

Authors:Chethan Bhateja, Joseph O'Brien, Afnaan Hashmi, Eva Prakash
Title: Cost and Reward Infused Metric Elicitation
Abstract:
In machine learning, metric elicitation refers to the selection of performance metrics that best reflect an individual's implicit preferences for a given application. Currently, metric elicitation methods only consider metrics that depend on the accuracy values encoded within a given model's confusion matrix. However, focusing solely on confusion matrices does not account for other model feasibility considerations such as varied monetary costs or latencies. In our work, we build upon the multiclass metric elicitation framework of Hiranandani et al., extrapolating their proposed Diagonal Linear Performance Metric Elicitation (DLPME) algorithm to account for additional bounded costs and rewards. Our experimental results with synthetic data demonstrate our approach's ability to quickly converge to the true metric.
中文摘要:本研究在多分类度量诱导框架基础上,引入超出混淆矩阵准确率的额外有界成本和奖励因素,通过合成数据实验证明所提方法能有效收敛至真实性能度量。
English Summary: This study extends the multiclass metric elicitation framework by incorporating additional bounded costs and rewards beyond confusion matrix accuracy, demonstrating through synthetic data experiments that the proposed approach efficiently converges to the true performance metric.

Authors:Md Rakibul Hasan, Yue Yao, Md Zakir Hossain, Aneesh Krishna, Imre Rudas, Shafin Rahman, Tom Gedeon
Title: Labels Generated by Large Language Models Help Measure People's Empathy in Vitro
Abstract:
Large language models (LLMs) have revolutionised many fields, with LLM-as-a-service (LLMSaaS) offering accessible, general-purpose solutions without costly task-specific training. In contrast to the widely studied prompt engineering for directly solving tasks (in vivo), this paper explores LLMs' potential for in-vitro applications: using LLM-generated labels to improve supervised training of mainstream models. We examine two strategies - (1) noisy label correction and (2) training data augmentation - in empathy computing, an emerging task to predict psychology-based questionnaire outcomes from inputs like textual narratives. Crowdsourced datasets in this domain often suffer from noisy labels that misrepresent underlying empathy. We show that replacing or supplementing these crowdsourced labels with LLM-generated labels, developed using psychology-based scale-aware prompts, achieves statistically significant accuracy improvements. Notably, the RoBERTa pre-trained language model (PLM) trained with noise-reduced labels yields a state-of-the-art Pearson correlation coefficient of 0.648 on the public NewsEmp benchmarks. This paper further analyses evaluation metric selection and demographic biases to help guide the future development of more equitable empathy computing models. Code and LLM-generated labels are available at https://github.com/hasan-rakibul/LLMPathy.
中文: 本研究证明,利用大语言模型生成的标签进行噪声标签校正和数据增强,显著提升了共情计算中监督模型的准确性,在基准数据集上达到了最先进的性能水平。
English: This study demonstrates that using large language model-generated labels for noisy label correction and data augmentation significantly improves the accuracy of supervised models in empathy computing, achieving state-of-the-art performance on benchmark datasets.

Authors:Peihao Wang, Ruisi Cai, Yuehao Wang, Jiajun Zhu, Pragya Srivastava, Zhangyang Wang, Pan Li
Title: Understanding and Mitigating Bottlenecks of State Space Models through the Lens of Recency and Over-smoothing
Abstract:
Structured State Space Models (SSMs) have emerged as alternatives to transformers. While SSMs are often regarded as effective in capturing long-sequence dependencies, we rigorously demonstrate that they are inherently limited by strong recency bias. Our empirical studies also reveal that this bias impairs the models' ability to recall distant information and introduces robustness issues. Our scaling experiments then discovered that deeper structures in SSMs can facilitate the learning of long contexts. However, subsequent theoretical analysis reveals that as SSMs increase in depth, they exhibit another inevitable tendency toward over-smoothing, e.g., token representations becoming increasingly indistinguishable. This fundamental dilemma between recency and over-smoothing hinders the scalability of existing SSMs. Inspired by our theoretical findings, we propose to polarize two channels of the state transition matrices in SSMs, setting them to zero and one, respectively, simultaneously addressing recency bias and over-smoothing. Experiments demonstrate that our polarization technique consistently enhances the associative recall accuracy of long-range tokens and unlocks SSMs to benefit further from deeper architectures. All source codes are released at https://github.com/VITA-Group/SSM-Bottleneck.
中文摘要:结构化状态空间模型(SSM)存在固有的近期偏好和深度引发的过度平滑问题,但通过极化状态转移矩阵的新方法可有效缓解这些限制,同时提升长距离关联回忆能力。
English Summary: Structured State Space Models (SSMs) suffer from inherent recency bias and depth-induced over-smoothing, but a novel polarization technique for state transition matrices effectively mitigates both limitations while enhancing long-range associative recall.

Authors:Abdesselam Ferdi
Title: Lightweight G-YOLOv11: Advancing Efficient Fracture Detection in Pediatric Wrist X-rays
Abstract:
Computer-aided diagnosis (CAD) systems have greatly improved the interpretation of medical images by radiologists and surgeons. However, current CAD systems for fracture detection in X-ray images primarily rely on large, resource-intensive detectors, which limits their practicality in clinical settings. To address this limitation, we propose a novel lightweight CAD system based on the YOLO detector for fracture detection. This system, named ghost convolution-based YOLOv11 (G-YOLOv11), builds on the latest version of the YOLO detector family and incorporates the ghost convolution operation for feature extraction. The ghost convolution operation generates the same number of feature maps as traditional convolution but requires fewer linear operations, thereby reducing the detector's computational resource requirements. We evaluated the performance of the proposed G-YOLOv11 detector on the GRAZPEDWRI-DX dataset, achieving an mAP@0.5 of 0.535 with an inference time of 2.4 ms on an NVIDIA A10 GPU. Compared to the standard YOLOv11l, G-YOLOv11l achieved reductions of 13.6% in mAP@0.5 and 68.7% in size. These results establish a new state-of-the-art benchmark in terms of efficiency, outperforming existing detectors. Code and models are available at https://github.com/AbdesselamFerdi/G-YOLOv11.
中文: 研究人员开发了一种名为G-YOLOv11的轻量级计算机辅助诊断系统,通过采用幽灵卷积技术实现X光图像骨折检测,在保持较高检测效率的同时将模型体积减小68.7%,创下了计算效率的新标杆。
English: Researchers have developed a lightweight computer-aided diagnosis system called G-YOLOv11 that uses ghost convolution to detect fractures in X-ray images with high efficiency, achieving state-of-the-art performance in computational speed while reducing model size by 68.7%.

Authors:Yuchuan Tian, Jing Han, Chengcheng Wang, Yuchen Liang, Chao Xu, Hanting Chen
Title: DiC: Rethinking Conv3x3 Designs in Diffusion Models
Abstract:
Diffusion models have shown exceptional performance in visual generation tasks. Recently, these models have shifted from traditional U-Shaped CNN-Attention hybrid structures to fully transformer-based isotropic architectures. While these transformers exhibit strong scalability and performance, their reliance on complicated self-attention operation results in slow inference speeds. Contrary to these works, we rethink one of the simplest yet fastest module in deep learning, 3x3 Convolution, to construct a scaled-up purely convolutional diffusion model. We first discover that an Encoder-Decoder Hourglass design outperforms scalable isotropic architectures for Conv3x3, but still under-performing our expectation. Further improving the architecture, we introduce sparse skip connections to reduce redundancy and improve scalability. Based on the architecture, we introduce conditioning improvements including stage-specific embeddings, mid-block condition injection, and conditional gating. These improvements lead to our proposed Diffusion CNN (DiC), which serves as a swift yet competitive diffusion architecture baseline. Experiments on various scales and settings show that DiC surpasses existing diffusion transformers by considerable margins in terms of performance while keeping a good speed advantage. Project page: https://github.com/YuchuanTian/DiC
中文摘要:提出的扩散卷积网络(DiC)通过采用优化的卷积结构、稀疏连接和改进的条件机制,在保持速度优势的同时,其性能显著超越了现有的扩散变换器模型。
English Summary: The proposed Diffusion CNN (DiC) replaces complex transformer architectures with optimized convolutional networks featuring sparse connections and enhanced conditioning, achieving superior performance and faster inference than existing diffusion models.

Authors:Daniel Sanchez, David Alfaya, Jaime Pizarroso
Title: Motives meet SymPy: studying $λ$-ring expressions in Python
Abstract:
We present a new Python package called "motives", a symbolic manipulation package based on SymPy capable of handling and simplifying motivic expressions in the Grothendieck ring of Chow motives and other types of $λ$-rings. The package is able to manipulate and compare arbitrary expressions in $λ$-rings and, in particular, it contains explicit tools for manipulating motives of several types of commonly used moduli schemes and moduli stacks of decorated bundles on curves. We have applied this new tool to advance in the verification of Mozgovoy's conjectural formula for the motive of the moduli space of twisted Higgs bundles, proving that it holds in rank 2 and 3 for any curve of genus up to 18 and any twisting bundle of small degree.
中文: "motives" Python 符号计算包能够处理格罗滕迪克环中的母题表达式,并已验证莫兹戈沃伊关于扭 Higgs 丛模空间猜想在秩2和3情形下的正确性。
English: The "motives" Python package enables symbolic manipulation and simplification of motivic expressions in Grothendieck rings, facilitating verification of Mozgovoy's conjecture for twisted Higgs bundles in ranks 2 and 3.

Authors:Yomal De Mel, Kasun Wickramasinghe, Nisansa de Silva, Surangika Ranathunga
Title: Sinhala Transliteration: A Comparative Analysis Between Rule-based and Seq2Seq Approaches
Abstract:
Due to reasons of convenience and lack of tech literacy, transliteration (i.e., Romanizing native scripts instead of using localization tools) is eminently prevalent in the context of low-resource languages such as Sinhala, which have their own writing script. In this study, our focus is on Romanized Sinhala transliteration. We propose two methods to address this problem: Our baseline is a rule-based method, which is then compared against our second method where we approach the transliteration problem as a sequence-to-sequence task akin to the established Neural Machine Translation (NMT) task. For the latter, we propose a Transformer-based Encode-Decoder solution. We witnessed that the Transformer-based method could grab many ad-hoc patterns within the Romanized scripts compared to the rule-based method. The code base associated with this paper is available on GitHub - https://github.com/kasunw22/Sinhala-Transliterator/
中文: 本研究针对僧伽罗语罗马化转写的普遍问题,比较了基于规则的方法和基于Transformer的序列到序列方法,发现后者能更有效地捕捉转写中的临时模式。
English: The study addresses the prevalence of Romanized Sinhala transliteration by comparing a rule-based method with a Transformer-based sequence-to-sequence approach, finding the latter more effective at capturing ad-hoc patterns in the scripts.

Authors:Madeleine Darbyshire, Elizabeth Sklar, Simon Parsons
Title: Exploiting Boundary Loss for the Hierarchical Panoptic Segmentation of Plants and Leaves
Abstract:
Precision agriculture leverages data and machine learning so that farmers can monitor their crops and target interventions precisely. This enables the precision application of herbicide only to weeds, or the precision application of fertilizer only to undernourished crops, rather than to the entire field. The approach promises to maximize yields while minimizing resource use and harm to the surrounding environment. To this end, we propose a hierarchical panoptic segmentation method that simultaneously determines leaf count (as an identifier of plant growth)and locates weeds within an image. In particular, our approach aims to improve the segmentation of smaller instances like the leaves and weeds by incorporating focal loss and boundary loss. Not only does this result in competitive performance, achieving a PQ+ of 81.89 on the standard training set, but we also demonstrate we can improve leaf-counting accuracy with our method. The code is available at https://github.com/madeleinedarbyshire/HierarchicalMask2Former.
Chinese: 精准农业利用机器学习实现针对性作物监测与干预,通过结合焦点损失和边界损失的层次全景分割方法,提升了叶片计数和杂草检测的准确性,取得了优异性能。
English: Precision agriculture employs machine learning for targeted crop monitoring and intervention, using a hierarchical panoptic segmentation method with focal and boundary loss to enhance leaf counting and weed detection, achieving high performance and improved accuracy.

Authors:Ke Yang, Volodymyr Kindratenko, ChengXiang Zhai
Title: TinyHelen's First Curriculum: Training and Evaluating Tiny Language Models in a Simpler Language Environment
Abstract:
Training language models (LMs) and their application agents is increasingly costly due to large datasets and models, making test failures difficult to bear. Simplified language environments serve as primordial training and testing grounds, retaining essential commonsense and communication skills but in a more digestible form, potentially enhancing the learning efficiency of LMs, and thus reducing the required model size and data volume for effective training and evaluation. In these simplified language environments, workable strategies for small models, datasets, and agents may be adaptable to larger models, datasets, and agents in complex language environments. To create such environments, we focus on two aspects: i) minimizing language dataset noise and complexity, and ii) preserving the essential text distribution characteristics. Unlike previous methods, we propose a pipeline to refine text data by eliminating noise, minimizing vocabulary, and maintaining genre-specific patterns (e.g., for books, conversation, code, etc.). Implementing this pipeline with large LMs, we have created a leaner suite of LM training and evaluation datasets: 71M Leaner-Pretrain, 7M Leaner-Instruct, Leaner-Glue for assessing linguistic proficiency, and Leaner-Eval for testing instruction-following ability. Our experiments show that leaner pre-training boosts LM learning efficiency. Tiny LMs trained on these datasets outperform those trained on original datasets in instruction-following across different language granularity levels. Moreover, the Leaner-Pretrain dataset's alignment with conventional large LM training sets enables resource-optimized analysis of how learning objectives, model architectures, and training techniques impact performance on language modeling and downstream tasks. Our code and datasets are available at https://github.com/EmpathYang/TinyHelen.git.
中文:简化语言环境通过降低数据集复杂性同时保留核心语言特征,提升了语言模型的训练效率,使得小型模型能超越传统训练方法的表现,并为资源优化的性能分析提供了可能。
English: Simplified language environments enhance LM training efficiency by reducing dataset complexity while preserving essential linguistic features, enabling smaller models to outperform traditional training methods and facilitating resource-optimized performance analysis.

Authors:Fangchen Yu, Ruilizhen Hu, Yidong Lin, Yuqi Ma, Zhenghao Huang, Wenye Li
Title: KAE: Kolmogorov-Arnold Auto-Encoder for Representation Learning
Abstract:
The Kolmogorov-Arnold Network (KAN) has recently gained attention as an alternative to traditional multi-layer perceptrons (MLPs), offering improved accuracy and interpretability by employing learnable activation functions on edges. In this paper, we introduce the Kolmogorov-Arnold Auto-Encoder (KAE), which integrates KAN with autoencoders (AEs) to enhance representation learning for retrieval, classification, and denoising tasks. Leveraging the flexible polynomial functions in KAN layers, KAE captures complex data patterns and non-linear relationships. Experiments on benchmark datasets demonstrate that KAE improves latent representation quality, reduces reconstruction errors, and achieves superior performance in downstream tasks such as retrieval, classification, and denoising, compared to standard autoencoders and other KAN variants. These results suggest KAE's potential as a useful tool for representation learning. Our code is available at \url{https://github.com/SciYu/KAE/}.
Chinese: Kolmogorov-Arnold自编码器(KAE)将KAN的可学习激活函数与自编码器结合,提升了表示学习能力,在检索、分类和去噪任务中表现优于传统方法。
English: The Kolmogorov-Arnold Auto-Encoder (KAE) combines KAN's learnable activation functions with autoencoders to enhance representation learning, achieving superior performance in retrieval, classification, and denoising tasks compared to standard methods.

Authors:Wenhao Dong, Yueyang Li, Weiming Zeng, Lei Chen, Hongjie Yan, Wai Ting Siok, Nizhuan Wang
Title: STARFormer: A Novel Spatio-Temporal Aggregation Reorganization Transformer of FMRI for Brain Disorder Diagnosis
Abstract:
Many existing methods that use functional magnetic resonance imaging (fMRI) classify brain disorders, such as autism spectrum disorder (ASD) and attention deficit hyperactivity disorder (ADHD), often overlook the integration of spatial and temporal dependencies of the blood oxygen level-dependent (BOLD) signals, which may lead to inaccurate or imprecise classification results. To solve this problem, we propose a Spatio-Temporal Aggregation eorganization ransformer (STARFormer) that effectively captures both spatial and temporal features of BOLD signals by incorporating three key modules. The region of interest (ROI) spatial structure analysis module uses eigenvector centrality (EC) to reorganize brain regions based on effective connectivity, highlighting critical spatial relationships relevant to the brain disorder. The temporal feature reorganization module systematically segments the time series into equal-dimensional window tokens and captures multiscale features through variable window and cross-window attention. The spatio-temporal feature fusion module employs a parallel transformer architecture with dedicated temporal and spatial branches to extract integrated features. The proposed STARFormer has been rigorously evaluated on two publicly available datasets for the classification of ASD and ADHD. The experimental results confirm that the STARFormer achieves state-of-the-art performance across multiple evaluation metrics, providing a more accurate and reliable tool for the diagnosis of brain disorders and biomedical research. The codes are available at: https://github.com/NZWANG/STARFormer.
中文:提出的STARFormer模型通过专门模块有效整合fMRI BOLD信号的空间与时间特征,在自闭症和多动症分类任务中取得最优性能,同时解决了以往方法在捕捉这些依赖关系方面的不足。
English: The proposed STARFormer model effectively integrates spatial and temporal features of fMRI BOLD signals through specialized modules, achieving state-of-the-art performance in classifying autism and ADHD while addressing previous methods' limitations in capturing these dependencies.

Authors:Wanlong Liu, Junying Chen, Ke Ji, Li Zhou, Wenyu Chen, Benyou Wang
Title: RAG-Instruct: Boosting LLMs with Diverse Retrieval-Augmented Instructions
Abstract:
Retrieval-Augmented Generation (RAG) has emerged as a key paradigm for enhancing large language models (LLMs) by incorporating external knowledge. However, current RAG methods face two limitations: (1) they only cover limited RAG scenarios. (2) They suffer from limited task diversity due to the lack of a general RAG dataset. To address these limitations, we propose RAG-Instruct, a general method for synthesizing diverse and high-quality RAG instruction data based on any source corpus. Our approach leverages (1) five RAG paradigms, which encompass diverse query-document relationships, and (2) instruction simulation, which enhances instruction diversity and quality by utilizing the strengths of existing instruction datasets. Using this method, we construct a 40K instruction dataset from Wikipedia, comprehensively covering diverse RAG scenarios and tasks. Experiments demonstrate that RAG-Instruct effectively enhances LLMs' RAG capabilities, achieving strong zero-shot performance and significantly outperforming various RAG baselines across a diverse set of tasks. RAG-Instruct is publicly available at https://github.com/FreedomIntelligence/RAG-Instruct.
中文摘要:RAG-Instruct是一种基于任意语料库、通过五种RAG范式和指令模拟技术合成多样化高质量指令数据的新方法,能显著提升大语言模型在多种任务中的检索增强生成性能。
English Summary: RAG-Instruct is a novel method that synthesizes diverse, high-quality instruction data from any corpus using five RAG paradigms and instruction simulation, effectively enhancing LLMs' retrieval-augmented generation capabilities across various tasks.

Authors:Runnan Chen, Zhaoqing Wang, Jiepeng Wang, Yuexin Ma, Mingming Gong, Wenping Wang, Tongliang Liu
Title: PanoSLAM: Panoptic 3D Scene Reconstruction via Gaussian SLAM
Abstract:
Understanding geometric, semantic, and instance information in 3D scenes from sequential video data is essential for applications in robotics and augmented reality. However, existing Simultaneous Localization and Mapping (SLAM) methods generally focus on either geometric or semantic reconstruction. In this paper, we introduce PanoSLAM, the first SLAM system to integrate geometric reconstruction, 3D semantic segmentation, and 3D instance segmentation within a unified framework. Our approach builds upon 3D Gaussian Splatting, modified with several critical components to enable efficient rendering of depth, color, semantic, and instance information from arbitrary viewpoints. To achieve panoptic 3D scene reconstruction from sequential RGB-D videos, we propose an online Spatial-Temporal Lifting (STL) module that transfers 2D panoptic predictions from vision models into 3D Gaussian representations. This STL module addresses the challenges of label noise and inconsistencies in 2D predictions by refining the pseudo labels across multi-view inputs, creating a coherent 3D representation that enhances segmentation accuracy. Our experiments show that PanoSLAM outperforms recent semantic SLAM methods in both mapping and tracking accuracy. For the first time, it achieves panoptic 3D reconstruction of open-world environments directly from the RGB-D video. (https://github.com/runnanchen/PanoSLAM)
中文: PanoSLAM是首个统一SLAM系统,通过3D高斯泼溅和时空提升模块,将几何重建、3D语义分割和实例分割集成,实现了从RGB-D视频中精确的全景3D重建。
English: PanoSLAM is the first unified SLAM system that integrates geometric reconstruction, 3D semantic segmentation, and instance segmentation using 3D Gaussian Splatting and a Spatial-Temporal Lifting module to achieve accurate panoptic 3D reconstruction from RGB-D videos.

Authors:Runnan Chen, Xiangyu Sun, Zhaoqing Wang, Youquan Liu, Jiepeng Wang, Lingdong Kong, Jiankang Deng, Mingming Gong, Liang Pan, Wenping Wang, Tongliang Liu
Title: OVGaussian: Generalizable 3D Gaussian Segmentation with Open Vocabularies
Abstract:
Open-vocabulary scene understanding using 3D Gaussian (3DGS) representations has garnered considerable attention. However, existing methods mostly lift knowledge from large 2D vision models into 3DGS on a scene-by-scene basis, restricting the capabilities of open-vocabulary querying within their training scenes so that lacking the generalizability to novel scenes. In this work, we propose \textbf{OVGaussian}, a generalizable \textbf{O}pen-\textbf{V}ocabulary 3D semantic segmentation framework based on the 3D \textbf{Gaussian} representation. We first construct a large-scale 3D scene dataset based on 3DGS, dubbed \textbf{SegGaussian}, which provides detailed semantic and instance annotations for both Gaussian points and multi-view images. To promote semantic generalization across scenes, we introduce Generalizable Semantic Rasterization (GSR), which leverages a 3D neural network to learn and predict the semantic property for each 3D Gaussian point, where the semantic property can be rendered as multi-view consistent 2D semantic maps. In the next, we propose a Cross-modal Consistency Learning (CCL) framework that utilizes open-vocabulary annotations of 2D images and 3D Gaussians within SegGaussian to train the 3D neural network capable of open-vocabulary semantic segmentation across Gaussian-based 3D scenes. Experimental results demonstrate that OVGaussian significantly outperforms baseline methods, exhibiting robust cross-scene, cross-domain, and novel-view generalization capabilities. Code and the SegGaussian dataset will be released. (https://github.com/runnanchen/OVGaussian).
中文: OVGaussian 是一种创新的开放式词汇三维语义分割框架,通过大规模标注数据集和跨模态学习,实现了跨场景和跨领域的强大泛化能力。
English: OVGaussian is a novel open-vocabulary 3D semantic segmentation framework that leverages a large-scale annotated dataset and cross-modal learning to achieve robust generalization across diverse scenes and domains.

Authors:Haoyu Han, Yu Wang, Harry Shomer, Kai Guo, Jiayuan Ding, Yongjia Lei, Mahantesh Halappanavar, Ryan A. Rossi, Subhabrata Mukherjee, Xianfeng Tang, Qi He, Zhigang Hua, Bo Long, Tong Zhao, Neil Shah, Amin Javari, Yinglong Xia, Jiliang Tang
Title: Retrieval-Augmented Generation with Graphs (GraphRAG)
Abstract:
Retrieval-augmented generation (RAG) is a powerful technique that enhances downstream task execution by retrieving additional information, such as knowledge, skills, and tools from external sources. Graph, by its intrinsic "nodes connected by edges" nature, encodes massive heterogeneous and relational information, making it a golden resource for RAG in tremendous real-world applications. As a result, we have recently witnessed increasing attention on equipping RAG with Graph, i.e., GraphRAG. However, unlike conventional RAG, where the retriever, generator, and external data sources can be uniformly designed in the neural-embedding space, the uniqueness of graph-structured data, such as diverse-formatted and domain-specific relational knowledge, poses unique and significant challenges when designing GraphRAG for different domains. Given the broad applicability, the associated design challenges, and the recent surge in GraphRAG, a systematic and up-to-date survey of its key concepts and techniques is urgently desired. Following this motivation, we present a comprehensive and up-to-date survey on GraphRAG. Our survey first proposes a holistic GraphRAG framework by defining its key components, including query processor, retriever, organizer, generator, and data source. Furthermore, recognizing that graphs in different domains exhibit distinct relational patterns and require dedicated designs, we review GraphRAG techniques uniquely tailored to each domain. Finally, we discuss research challenges and brainstorm directions to inspire cross-disciplinary opportunities. Our survey repository is publicly maintained at https://github.com/Graph-RAG/GraphRAG/.
Chinese: GraphRAG通过利用图结构数据增强检索生成能力,针对不同领域的独特挑战提出了系统性框架和定制化技术解决方案。
English: GraphRAG enhances retrieval-augmented generation by leveraging graph-structured data, addressing unique challenges across domains through a systematic framework and tailored techniques.

Authors:Shi-Feng Peng, Guolei Sun, Yong Li, Hongsong Wang, Guo-Sen Xie
Title: SAM-Aware Graph Prompt Reasoning Network for Cross-Domain Few-Shot Segmentation
Abstract:
The primary challenge of cross-domain few-shot segmentation (CD-FSS) is the domain disparity between the training and inference phases, which can exist in either the input data or the target classes. Previous models struggle to learn feature representations that generalize to various unknown domains from limited training domain samples. In contrast, the large-scale visual model SAM, pre-trained on tens of millions of images from various domains and classes, possesses excellent generalizability. In this work, we propose a SAM-aware graph prompt reasoning network (GPRN) that fully leverages SAM to guide CD-FSS feature representation learning and improve prediction accuracy. Specifically, we propose a SAM-aware prompt initialization module (SPI) to transform the masks generated by SAM into visual prompts enriched with high-level semantic information. Since SAM tends to divide an object into many sub-regions, this may lead to visual prompts representing the same semantic object having inconsistent or fragmented features. We further propose a graph prompt reasoning (GPR) module that constructs a graph among visual prompts to reason about their interrelationships and enable each visual prompt to aggregate information from similar prompts, thus achieving global semantic consistency. Subsequently, each visual prompt embeds its semantic information into the corresponding mask region to assist in feature representation learning. To refine the segmentation mask during testing, we also design a non-parameter adaptive point selection module (APS) to select representative point prompts from query predictions and feed them back to SAM to refine inaccurate segmentation results. Experiments on four standard CD-FSS datasets demonstrate that our method establishes new state-of-the-art results. Code: https://github.com/CVL-hub/GPRN.
中文: 提出的SAM感知图提示推理网络(GPRN)利用大规模视觉模型SAM的泛化能力,通过将SAM生成的掩码转化为语义提示并借助图推理确保全局一致性,有效解决了跨域少样本分割中的领域差异问题,在多个数据集上取得了最优性能。
English: The proposed SAM-aware graph prompt reasoning network (GPRN) leverages the generalizability of the large-scale visual model SAM to address domain disparity in cross-domain few-shot segmentation by converting SAM masks into semantic prompts and ensuring global consistency through graph reasoning, achieving state-of-the-art results on multiple datasets.

Authors:Rajat Talak, Charis Georgiou, Jingnan Shi, Luca Carlone
Title: Outlier-Robust Training of Machine Learning Models
Abstract:
Robust training of machine learning models in the presence of outliers has garnered attention across various domains. The use of robust losses is a popular approach and is known to mitigate the impact of outliers. We bring to light two literatures that have diverged in their ways of designing robust losses: one using M-estimation, which is popular in robotics and computer vision, and another using a risk-minimization framework, which is popular in deep learning. We first show that a simple modification of the Black-Rangarajan duality provides a unifying view. The modified duality brings out a definition of a robust loss kernel $σ$ that is satisfied by robust losses in both the literatures. Secondly, using the modified duality, we propose an Adaptive Alternation Algorithm (AAA) for training machine learning models with outliers. The algorithm iteratively trains the model by using a weighted version of the non-robust loss, while updating the weights at each iteration. The algorithm is augmented with a novel parameter update rule by interpreting the weights as inlier probabilities, and obviates the need for complex parameter tuning. Thirdly, we investigate convergence of the adaptive alternation algorithm to outlier-free optima. Considering arbitrary outliers (i.e., with no distributional assumption on the outliers), we show that the use of robust loss kernels σ increases the region of convergence. We experimentally show the efficacy of our algorithm on regression, classification, and neural scene reconstruction problems. We release our implementation code: https://github.com/MIT-SPARK/ORT.
中文摘要:本文通过改进的对偶框架提出了鲁棒损失设计的统一视角,并开发了一种自适应交替算法,通过将权重作为内点概率迭代更新来增强含异常值时的模型训练,在多种任务中提升了收敛性和性能。
English Summary: This paper introduces a unified view of robust loss design through a modified duality framework and proposes an Adaptive Alternation Algorithm that enhances model training with outliers by iteratively updating weights as inlier probabilities, improving convergence and performance across various tasks.

Authors:Edwin Arkel Rios, Jansen Christopher Yuanda, Vincent Leon Ghanz, Cheng-Wei Yu, Bo-Cheng Lai, Min-Chun Hu
Title: Cross-Layer Cache Aggregation for Token Reduction in Ultra-Fine-Grained Image Recognition
Abstract:
Ultra-fine-grained image recognition (UFGIR) is a challenging task that involves classifying images within a macro-category. While traditional FGIR deals with classifying different species, UFGIR goes beyond by classifying sub-categories within a species such as cultivars of a plant. In recent times the usage of Vision Transformer-based backbones has allowed methods to obtain outstanding recognition performances in this task but this comes at a significant cost in terms of computation specially since this task significantly benefits from incorporating higher resolution images. Therefore, techniques such as token reduction have emerged to reduce the computational cost. However, dropping tokens leads to loss of essential information for fine-grained categories, specially as the token keep rate is reduced. Therefore, to counteract the loss of information brought by the usage of token reduction we propose a novel Cross-Layer Aggregation Classification Head and a Cross-Layer Cache mechanism to recover and access information from previous layers in later locations. Extensive experiments covering more than 2000 runs across diverse settings including 5 datasets, 9 backbones, 7 token reduction methods, 5 keep rates, and 2 image sizes demonstrate the effectiveness of the proposed plug-and-play modules and allow us to push the boundaries of accuracy vs cost for UFGIR by reducing the kept tokens to extremely low ratios of up to 10\% while maintaining a competitive accuracy to state-of-the-art models. Code is available at: \url{https://github.com/arkel23/CLCA}
中文: 针对超细粒度图像识别中令牌削减导致的信息丢失问题,我们提出了跨层聚合分类头和跨层缓存机制,通过在多种数据集和骨干网络上验证,能在削减高达90%令牌的同时保持优异识别精度。
English: To address the information loss from token reduction in ultra-fine-grained image recognition, we introduce a Cross-Layer Aggregation Classification Head and Cross-Layer Cache mechanism, enabling high accuracy with up to 90% token reduction across diverse datasets and backbones.

Authors:Duo Zhou, Christopher Brix, Grani A Hanasusanto, Huan Zhang
Title: Scalable Neural Network Verification with Branch-and-bound Inferred Cutting Planes
Abstract:
Recently, cutting-plane methods such as GCP-CROWN have been explored to enhance neural network verifiers and made significant advances. However, GCP-CROWN currently relies on generic cutting planes (cuts) generated from external mixed integer programming (MIP) solvers. Due to the poor scalability of MIP solvers, large neural networks cannot benefit from these cutting planes. In this paper, we exploit the structure of the neural network verification problem to generate efficient and scalable cutting planes specific for this problem setting. We propose a novel approach, Branch-and-bound Inferred Cuts with COnstraint Strengthening (BICCOS), which leverages the logical relationships of neurons within verified subproblems in the branch-and-bound search tree, and we introduce cuts that preclude these relationships in other subproblems. We develop a mechanism that assigns influence scores to neurons in each path to allow the strengthening of these cuts. Furthermore, we design a multi-tree search technique to identify more cuts, effectively narrowing the search space and accelerating the BaB algorithm. Our results demonstrate that BICCOS can generate hundreds of useful cuts during the branch-and-bound process and consistently increase the number of verifiable instances compared to other state-of-the-art neural network verifiers on a wide range of benchmarks, including large networks that previous cutting plane methods could not scale to. BICCOS is part of the $α,β$-CROWN verifier, the VNN-COMP 2024 winner. The code is available at http://github.com/Lemutisme/BICCOS .
中文:BICCOS通过利用神经网络验证问题的结构生成专用割平面,显著提升了大规模网络的可验证性,超越了以往方法的局限。
English: BICCOS introduces specialized cutting planes derived from neural network verification's structure, enhancing scalability and verification rates for large networks beyond previous methods' limits.

Authors:James P. Beno
Title: ELECTRA and GPT-4o: Cost-Effective Partners for Sentiment Analysis
Abstract:
Bidirectional transformers excel at sentiment analysis, and Large Language Models (LLM) are effective zero-shot learners. Might they perform better as a team? This paper explores collaborative approaches between ELECTRA and GPT-4o for three-way sentiment classification. We fine-tuned (FT) four models (ELECTRA Base/Large, GPT-4o/4o-mini) using a mix of reviews from Stanford Sentiment Treebank (SST) and DynaSent. We provided input from ELECTRA to GPT as: predicted label, probabilities, and retrieved examples. Sharing ELECTRA Base FT predictions with GPT-4o-mini significantly improved performance over either model alone (82.50 macro F1 vs. 79.14 ELECTRA Base FT, 79.41 GPT-4o-mini) and yielded the lowest cost/performance ratio (\$0.12/F1 point). However, when GPT models were fine-tuned, including predictions decreased performance. GPT-4o FT-M was the top performer (86.99), with GPT-4o-mini FT close behind (86.70) at much less cost (\$0.38 vs. \$1.59/F1 point). Our results show that augmenting prompts with predictions from fine-tuned encoders is an efficient way to boost performance, and a fine-tuned GPT-4o-mini is nearly as good as GPT-4o FT at 76% less cost. Both are affordable options for projects with limited resources.
中文: 研究表明,将ELECTRA的预测结果与GPT-4o-mini结合能显著提升情感分析性能且成本效益突出,而经过微调的GPT模型中,GPT-4o-mini以76%的成本降幅实现了与GPT-4o近乎相当的性能。
English: This study demonstrates that combining ELECTRA's predictions with GPT-4o-mini significantly enhances sentiment analysis performance cost-effectively, while fine-tuned GPT models achieve top results with GPT-4o-mini offering nearly equivalent performance at substantially lower cost.

Authors:Zhengqi Xu, Han Zheng, Jie Song, Li Sun, Mingli Song
Title: Training-free Heterogeneous Model Merging
Abstract:
Model merging has attracted significant attention as a powerful paradigm for model reuse, facilitating the integration of task-specific models into a singular, versatile framework endowed with multifarious capabilities. Previous studies, predominantly utilizing methods such as Weight Average (WA), have shown that model merging can effectively leverage pretrained models without the need for laborious retraining. However, the inherent heterogeneity among models poses a substantial constraint on its applicability, particularly when confronted with discrepancies in model architectures. To overcome this challenge, we propose an innovative model merging framework designed for heterogeneous models, encompassing both depth and width heterogeneity. To address depth heterogeneity, we introduce a layer alignment strategy that harmonizes model layers by segmenting deeper models, treating consecutive layers with similar representations as a cohesive segment, thus enabling the seamless merging of models with differing layer depths. For width heterogeneity, we propose a novel elastic neuron zipping algorithm that projects the weights from models of varying widths onto a common dimensional space, eliminating the need for identical widths. Extensive experiments validate the efficacy of these proposed methods, demonstrating that the merging of structurally heterogeneous models can achieve performance levels comparable to those of homogeneous merging, across both vision and NLP tasks. Our code is publicly available at https://github.com/zju-vipa/training_free_heterogeneous_model_merging.
Chinese: 本研究提出了一种创新的模型融合框架,通过层对齐策略和弹性神经元压缩算法解决深度和宽度异质性问题,实现了结构异构模型的有效整合且性能无损。
English: This study introduces a novel model merging framework that addresses depth and width heterogeneity through layer alignment and an elastic neuron zipping algorithm, enabling effective integration of structurally diverse models without performance loss.

Authors:Witold Wydmański, Ulvi Movsum-zada, Jacek Tabor, Marek Śmieja
Title: VisTabNet: Adapting Vision Transformers for Tabular Data
Abstract:
Although deep learning models have had great success in natural language processing and computer vision, we do not observe comparable improvements in the case of tabular data, which is still the most common data type used in biological, industrial and financial applications. In particular, it is challenging to transfer large-scale pre-trained models to downstream tasks defined on small tabular datasets. To address this, we propose VisTabNet -- a cross-modal transfer learning method, which allows for adapting Vision Transformer (ViT) with pre-trained weights to process tabular data. By projecting tabular inputs to patch embeddings acceptable by ViT, we can directly apply a pre-trained Transformer Encoder to tabular inputs. This approach eliminates the conceptual cost of designing a suitable architecture for processing tabular data, while reducing the computational cost of training the model from scratch. Experimental results on multiple small tabular datasets (less than 1k samples) demonstrate VisTabNet's superiority, outperforming both traditional ensemble methods and recent deep learning models. The proposed method goes beyond conventional transfer learning practice and shows that pre-trained image models can be transferred to solve tabular problems, extending the boundaries of transfer learning. We share our example implementation as a GitHub repository available at https://github.com/wwydmanski/VisTabNet.
中文: VisTabNet提出了一种跨模态迁移学习方法,通过将表格数据转换为适合预训练视觉Transformer处理的嵌入表示,在小样本表格数据集上超越了传统方法,拓展了迁移学习的应用边界。
English: VisTabNet introduces a cross-modal transfer learning approach that adapts pre-trained Vision Transformers to process tabular data, outperforming traditional methods on small datasets and demonstrating the potential of repurposing image models for tabular tasks.

Authors:Dibakar Gope, David Mansell, Danny Loh, Ian Bratt
Title: Highly Optimized Kernels and Fine-Grained Codebooks for LLM Inference on Arm CPUs
Abstract:
Large language models (LLMs) have transformed the way we think about language understanding and generation, enthralling both researchers and developers. However, deploying LLMs for inference has been a significant challenge due to their unprecedented size and resource requirements. While quantizing model weights to sub-byte precision has emerged as a promising solution to ease memory pressure, the group quantization formats commonly used for LLM quantization have significant compute overheads and a resource-intensive dequantization process. As a result, a higher proportion of compute instructions do not perform multiplies, i.e., real work, rendering them unsuitable for meeting the required latency requirements for LLMs deployed on commodity CPUs. In this work, we propose a set of highly optimized kernels to accelerate LLM inference and unleash the full potential of CPUs, particularly Arm CPUs. These kernels amortize the cost of loading the operands and the cost of weight unpacking across multiple output rows. This, along with the introduction of an optimized interleaved group data layout for weights and decompression path optimizations to reduce unnecessary operations and dequantization overhead while maximizing the use of vector and matrix multiply operations, significantly improves the efficiency of MAC operations. Furthermore, we present a groupwise non-uniform codebook-based quantization method for ultra-low-precision quantization of LLMs to better match non-uniform patterns in their weight distributions, demonstrating better throughput during token generation while ensuring better quality than the state-of-the-art. Applying these improvements to 4-bit LLMs results in a 3-3.2x improvement in prompt processing and a 2x improvement in autoregressive decoding on Arm CPUs, compared to LLaMA.cpp-based solution. The optimized kernels are available at https://github.com/ggerganov/llama.cpp.
中文: 本研究提出了优化的内核和创新的量化方法,以提升大型语言模型在CPU上的推理效率,在提示处理和自回归解码方面实现了显著的加速效果。
English: This work introduces optimized kernels and a novel quantization method to enhance LLM inference efficiency on CPUs, achieving significant speed improvements in prompt processing and decoding.

Authors:Linqin Wang, Yaping Liu, Zhengtao Yu, Shengxiang Gao, Cunli Mao, Yuxin Huang, Wenjun Wang, Ling Dong
Title: SECodec: Structural Entropy-based Compressive Speech Representation Codec for Speech Language Models
Abstract:
With the rapid advancement of large language models (LLMs), discrete speech representations have become crucial for integrating speech into LLMs. Existing methods for speech representation discretization rely on a predefined codebook size and Euclidean distance-based quantization. However, 1) the size of codebook is a critical parameter that affects both codec performance and downstream task training efficiency. 2) The Euclidean distance-based quantization may lead to audio distortion when the size of the codebook is controlled within a reasonable range. In fact, in the field of information compression, structural information and entropy guidance are crucial, but previous methods have largely overlooked these factors. Therefore, we address the above issues from an information-theoretic perspective, we present SECodec, a novel speech representation codec based on structural entropy (SE) for building speech language models. Specifically, we first model speech as a graph, clustering the speech features nodes within the graph and extracting the corresponding codebook by hierarchically and disentangledly minimizing 2D SE. Then, to address the issue of audio distortion, we propose a new quantization method. This method still adheres to the 2D SE minimization principle, adaptively selecting the most suitable token corresponding to the cluster for each incoming original speech node. Furthermore, we develop a Structural Entropy-based Speech Language Model (SESLM) that leverages SECodec. Experimental results demonstrate that SECodec performs comparably to EnCodec in speech reconstruction, and SESLM surpasses VALL-E in zero-shot text-to-speech tasks. Code, demo speeches, speech feature graph, SE codebook, and models are available at https://github.com/wlq2019/SECodec.
中文: SECodec提出了一种基于结构熵的语音编解码器,通过优化码本生成和量化方法,在语音重建上媲美EnCodec,并在零样本文本转语音任务中超越VALL-E。
English: SECodec introduces a novel speech codec using structural entropy to optimize codebook generation and quantization, achieving performance comparable to EnCodec in reconstruction and surpassing VALL-E in zero-shot text-to-speech tasks.

Authors:You Li, Heyu Huang, Chi Chen, Kaiyu Huang, Chao Huang, Zonghao Guo, Zhiyuan Liu, Jinan Xu, Yuhua Li, Ruixuan Li, Maosong Sun
Title: Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models
Abstract:
The recent advancement of Multimodal Large Language Models (MLLMs) has significantly improved their fine-grained perception of single images and general comprehension across multiple images. However, existing MLLMs still face challenges in achieving precise grounding in complex multi-image scenarios. To address this, we first explore a Chain-of-Thought (CoT) framework that integrates single-image grounding with multi-image comprehension. While partially effective, it remains unstable and struggles to capture abstract visual information due to its non-end-to-end nature. Therefore, we introduce Migician, the first multi-image grounding model capable of performing free-form and accurate grounding across multiple images. To support this, we present the MGrounding-630k dataset, which comprises data for several multi-image grounding tasks derived from existing datasets, along with newly generated free-form grounding instruction-following data. Furthermore, we propose MIG-Bench, a comprehensive benchmark specifically designed for evaluating multi-image grounding capabilities. Experimental results demonstrate that our model achieves significantly superior multi-image grounding capabilities, outperforming the best existing MLLMs by 24.94% and even surpassing much larger 70B models. Our code, model, dataset, and benchmark are fully open-sourced at https://migician-vg.github.io/.
中文: 该研究提出了首个多图像精准定位模型Migician,并配套新数据集和基准测试,性能超越现有最佳模型24.94%。
English: The study introduces Migician, the first model for precise multi-image grounding, supported by a new dataset and benchmark, achieving a 24.94% performance improvement over existing models.

Authors:Hui Li, Xiaoyu Ren, Hongjiu Yu, Huiyu Duan, Kai Li, Ying Chen, Libo Wang, Xiongkuo Min, Guangtao Zhai, Xu Liu
Title: Facial Attractiveness Prediction in Live Streaming: A New Benchmark and Multi-modal Method
Abstract:
Facial attractiveness prediction (FAP) has long been an important computer vision task, which could be widely applied in live streaming for facial retouching, content recommendation, etc. However, previous FAP datasets are either small, closed-source, or lack diversity. Moreover, the corresponding FAP models exhibit limited generalization and adaptation ability. To overcome these limitations, in this paper we present LiveBeauty, the first large-scale live-specific FAP dataset, in a more challenging application scenario, i.e., live streaming. 10,000 face images are collected from a live streaming platform directly, with 200,000 corresponding attractiveness annotations obtained from a well-devised subjective experiment, making LiveBeauty the largest open-access FAP dataset in the challenging live scenario. Furthermore, a multi-modal FAP method is proposed to measure the facial attractiveness in live streaming. Specifically, we first extract holistic facial prior knowledge and multi-modal aesthetic semantic features via a Personalized Attractiveness Prior Module (PAPM) and a Multi-modal Attractiveness Encoder Module (MAEM), respectively, then integrate the extracted features through a Cross-Modal Fusion Module (CMFM). Extensive experiments conducted on both LiveBeauty and other open-source FAP datasets demonstrate that our proposed method achieves state-of-the-art performance. Dataset will be available soon.
Chinese: 本文提出了首个面向直播场景的大规模公开面部吸引力预测数据集LiveBeauty,并设计了一种融合面部先验知识与多模态美学特征的新型方法,在实验中取得了最优性能。
English: This paper introduces LiveBeauty, the first large-scale, open-access facial attractiveness prediction dataset for live streaming, and proposes a multi-modal method that achieves state-of-the-art performance by integrating facial prior knowledge and aesthetic features.

Authors:Zitong Xu, Huiyu Duan, Guangji Ma, Liu Yang, Jiarui Wang, Qingbo Wu, Xiongkuo Min, Guangtao Zhai, Patrick Le Callet
Title: HarmonyIQA: Pioneering Benchmark and Model for Image Harmonization Quality Assessment
Abstract:
Image composition involves extracting a foreground object from one image and pasting it into another image through Image harmonization algorithms (IHAs), which aim to adjust the appearance of the foreground object to better match the background. Existing image quality assessment (IQA) methods may fail to align with human visual preference on image harmonization due to the insensitivity to minor color or light inconsistency. To address the issue and facilitate the advancement of IHAs, we introduce the first Image Quality Assessment Database for image Harmony evaluation (HarmonyIQAD), which consists of 1,350 harmonized images generated by 9 different IHAs, and the corresponding human visual preference scores. Based on this database, we propose a Harmony Image Quality Assessment (HarmonyIQA), to predict human visual preference for harmonized images. Extensive experiments show that HarmonyIQA achieves state-of-the-art performance on human visual preference evaluation for harmonized images, and also achieves competing results on traditional IQA tasks. Furthermore, cross-dataset evaluation also shows that HarmonyIQA exhibits better generalization ability than self-supervised learning-based IQA methods. Both HarmonyIQAD and HarmonyIQA will be made publicly available upon paper publication.
中文摘要:本文提出了首个用于图像和谐度评估的图像质量评价数据库HarmonyIQAD,并开发了HarmonyIQA方法,该方法在预测人类对和谐化图像的视觉偏好方面优于现有技术,同时展现出优异的泛化能力。
English Summary: This paper introduces HarmonyIQAD, the first image quality assessment database for evaluating image harmonization, and proposes HarmonyIQA, a novel method that outperforms existing approaches in predicting human visual preferences for harmonized images while demonstrating strong generalization capabilities.

Authors:Yiming Zhang, Zicheng Zhang, Xinyi Wei, Xiaohong Liu, Guangtao Zhai, Xiongkuo Min
Title: IllusionBench+: A Large-scale and Comprehensive Benchmark for Visual Illusion Understanding in Vision-Language Models
Abstract:
Current Visual Language Models (VLMs) show impressive image understanding but struggle with visual illusions, especially in real-world scenarios. Existing benchmarks focus on classical cognitive illusions, which have been learned by state-of-the-art (SOTA) VLMs, revealing issues such as hallucinations and limited perceptual abilities. To address this gap, we introduce IllusionBench, a comprehensive visual illusion dataset that encompasses not only classic cognitive illusions but also real-world scene illusions. This dataset features 1,051 images, 5,548 question-answer pairs, and 1,051 golden text descriptions that address the presence, causes, and content of the illusions. We evaluate ten SOTA VLMs on this dataset using true-or-false, multiple-choice, and open-ended tasks. In addition to real-world illusions, we design trap illusions that resemble classical patterns but differ in reality, highlighting hallucination issues in SOTA models. The top-performing model, GPT-4o, achieves 80.59% accuracy on true-or-false tasks and 76.75% on multiple-choice questions, but still lags behind human performance. In the semantic description task, GPT-4o's hallucinations on classical illusions result in low scores for trap illusions, even falling behind some open-source models. IllusionBench is, to the best of our knowledge, the largest and most comprehensive benchmark for visual illusions in VLMs to date.
Chinese: 当前视觉语言模型在常规图像理解上表现卓越,但在处理视觉错觉时存在明显不足;为此我们开发了IllusionBench综合数据集,通过经典与现实场景的错觉测试揭示主流模型的缺陷,其中最佳模型GPT-4o的表现仍逊于人类水平。
English: Current Visual Language Models excel in general image understanding but falter with visual illusions, prompting the creation of IllusionBench—a comprehensive dataset that exposes their limitations in handling both classical and real-world illusions, where even top models like GPT-4o underperform compared to humans.

Authors:Xuzeng Li, Tao Zhang, Jian Wang, Zhen Han, Jiqiang Liu, Jiawen Kang, Dusit Niyato, Abbas Jamalipour
Title: Achieving Network Resilience through Graph Neural Network-enabled Deep Reinforcement Learning
Abstract:
Deep reinforcement learning (DRL) has been widely used in many important tasks of communication networks. In order to improve the perception ability of DRL on the network, some studies have combined graph neural networks (GNNs) with DRL, which use the GNNs to extract unstructured features of the network. However, as networks continue to evolve and become increasingly complex, existing GNN-DRL methods still face challenges in terms of scalability and robustness. Moreover, these methods are inadequate for addressing network security issues. From the perspective of security and robustness, this paper explores the solution of combining GNNs with DRL to build a resilient network. This article starts with a brief tutorial of GNNs and DRL, and introduces their existing applications in networks. Furthermore, we introduce the network security methods that can be strengthened by GNN-DRL approaches. Then, we designed a framework based on GNN-DRL to defend against attacks and enhance network resilience. Additionally, we conduct a case study using an encrypted traffic dataset collected from real IoT environments, and the results demonstrated the effectiveness and superiority of our framework. Finally, we highlight key open challenges and opportunities for enhancing network resilience with GNN-DRL.
中文: 本文提出一种图神经网络与深度强化学习融合的框架,通过物联网加密流量案例验证了其在提升网络安全性和韧性方面的有效性,同时解决了可扩展性与鲁棒性难题。
English: This paper proposes a GNN-DRL framework to enhance network security and resilience, demonstrating its effectiveness through a case study on IoT encrypted traffic while addressing scalability and robustness challenges.

Authors:Tianhao Liu, Jiqiang Liu, Tao Zhang, Jian Wang, Jiacheng Wang, Jiawen Kang, Dusit Niyato, Shiwen Mao
Title: Generative AI-driven Cross-layer Covert Communication: Fundamentals, Framework and Case Study
Abstract:
Ensuring end-to-end cross-layer communication security in military networks by selecting covert schemes between nodes is a key solution for military communication security. With the development of communication technology, covert communication has expanded from the physical layer to the network and application layers, utilizing methods such as artificial noise, private networks, and semantic coding to transmit secret messages. However, as adversaries continuously eavesdrop on specific communication channels, the accumulation of sufficient data may reveal underlying patterns that influence concealment, and establishing a cross-layer covert communication mechanism emerges as an effective strategy to mitigate these regulatory challenges. In this article, we first survey the communication security solution based on covert communication, specifically targeting three typical scenarios: device-to-device, private network communication, and public network communication, and analyze their application scopes. Furthermore, we propose an end-to-end cross-layer covert communication scheme driven by Generative Artificial Intelligence (GenAI), highlighting challenges and their solutions. Additionally, a case study is conducted using diffusion reinforcement learning to sovle cloud edge internet of things cross-layer secure communication.
中文摘要:本文研究跨层隐蔽通信作为军事网络安全的关键解决方案,提出生成式人工智能驱动的方案并分析其在多种场景下的应用,包括使用扩散强化学习的案例研究。
English Summary: This article explores cross-layer covert communication as a key solution for military network security, proposing a Generative AI-driven scheme and analyzing its application across various scenarios including a case study using diffusion reinforcement learning.

Authors:Guangyuan Liu, Yinqiu Liu, Jiacheng Wang, Hongyang Du, Dusit Niyato, Jiawen Kang, Zehui Xiong
Title: Adaptive Contextual Caching for Mobile Edge Large Language Model Service
Abstract:
Mobile edge Large Language Model (LLM) deployments face inherent constraints, such as limited computational resources and network bandwidth. Although Retrieval-Augmented Generation (RAG) mitigates some challenges by integrating external knowledge bases, inefficient cache management can still result in high retrieval latency and frequent cache updates. To address these issues, we propose an Adaptive Contextual Caching (ACC) framework that anticipates user needs by proactively caching semantically relevant data for mobile-edge LLMs. ACC utilizes a deep reinforcement learning (DRL) module to refine cache replacement policies, balancing user context, document similarity, and the overhead associated with cache misses. Experimental results demonstrate that ACC increases cache hit rates to over 80\% after only 11 training episodes, outperforming FIFO, LRU, and semantic-only caching while reducing retrieval latency by up to 40\%. In particular, ACC also reduces local caching overhead (i.e., the cost of updating the cache when a miss occurs) by as much as 55\%, enabling scalable, low-latency LLM services in resource-constrained edge environments.
中文摘要:提出的自适应上下文缓存(ACC)框架通过深度强化学习优化移动边缘大语言模型的缓存管理,在资源受限环境中显著提升缓存命中率并降低检索延迟和开销。
English Summary: The proposed Adaptive Contextual Caching (ACC) framework uses deep reinforcement learning to optimize cache management for mobile-edge LLMs, significantly improving cache hit rates and reducing retrieval latency and overhead in resource-constrained environments.

Authors:Yan Sun, Tiansheng Huang, Liang Ding, Li Shen, Dacheng Tao
Title: TeZO: Empowering the Low-Rankness on the Temporal Dimension in the Zeroth-Order Optimization for Fine-tuning LLMs
Abstract:
Zeroth-order optimization (ZO) has demonstrated remarkable promise in efficient fine-tuning tasks for Large Language Models (LLMs). In particular, recent advances incorporate the low-rankness of gradients, introducing low-rank ZO estimators to further reduce GPU memory consumption. However, most existing works focus solely on the low-rankness of each individual gradient, overlooking a broader property shared by all gradients throughout the training, i.e., all gradients approximately reside within a similar subspace. In this paper, we consider two factors together and propose a novel low-rank ZO estimator, TeZO, which captures the low-rankness across both the model and temporal dimension. Specifically, we represent ZO perturbations along the temporal dimension as a 3D tensor and employ Canonical Polyadic Decomposition (CPD) to extract each low-rank 2D matrix, significantly reducing the training cost. TeZO can also be easily extended to the Adam variant while consuming less memory than MeZO-SGD, and requiring about only 35% memory of MeZO-Adam. Both comprehensive theoretical analysis and extensive experimental research have validated its efficiency, achieving SOTA-comparable results with lower overhead of time and memory.
中文:TeZO提出了一种新颖的低秩零阶优化方法,同时捕捉模型维度和时间维度的梯度低秩特性,以更低的内存和时间开销实现了与前沿方法相当的性能。
English: TeZO introduces a novel low-rank zeroth-order optimization method that captures gradient low-rankness across model and temporal dimensions, achieving state-of-the-art efficiency with reduced memory and time overhead.

Authors:Anke Tang, Enneng Yang, Li Shen, Yong Luo, Han Hu, Bo Du, Dacheng Tao
Title: Merging Models on the Fly Without Retraining: A Sequential Approach to Scalable Continual Model Merging
Abstract:
Deep model merging represents an emerging research direction that combines multiple fine-tuned models to harness their specialized capabilities across different tasks and domains. Current model merging techniques focus on merging all available models simultaneously, with weight interpolation-based methods being the predominant approaches. However, these conventional approaches are not well-suited for scenarios where models become available sequentially, and they often suffer from high memory requirements and potential interference between tasks. In this study, we propose a training-free projection-based continual merging method that processes models sequentially through orthogonal projections of weight matrices and adaptive scaling mechanisms. Our method operates by projecting new parameter updates onto subspaces orthogonal to existing merged parameter updates while using an adaptive scaling mechanism to maintain stable parameter distances, enabling efficient sequential integration of task-specific knowledge. Our approach maintains constant memory complexity to the number of models, minimizes interference between tasks through orthogonal projections, and retains the performance of previously merged models through adaptive task vector scaling. Extensive experiments on CLIP-ViT models demonstrate that our method achieves a 5-8% average accuracy improvement while maintaining robust performance in different task orderings.
中文: 本研究提出一种免训练的连续模型融合方法,通过正交投影和自适应缩放机制顺序整合任务专用模型,在保持恒定内存使用的同时实现5-8%的准确率提升,并有效减少任务间干扰。
English: This study introduces a training-free continual model merging method that sequentially integrates task-specific models via orthogonal projections and adaptive scaling, achieving 5-8% higher accuracy with constant memory usage and minimal task interference.

Authors:Zhen Tian, Wayne Xin Zhao, Ji-Rong Wen
Title: Irrational Complex Rotations Empower Low-bit Optimizers
Abstract:
In this paper, we propose a novel optimizer state compression algorithm, namely $π$-Quant, which leverages the properties of irrational numbers (e.g., $π$) for memory-efficient training. The core idea is based on our mathematical findings, which show that a pair of parameters can be represented by a single rotation angle using the complex rotation scheme. Building on this insight, we map the parameters into a complex space and perform quantization using the corresponding rotation angles. To efficiently integrate it into optimization process, we develop an efficient system of geometric equations that computes the precise rotation angles with linear complexity. We evaluate $π$-Quant on a wide range of tasks. Our experiments show that it can reduce the bit-width of parameters to 3.32-bit, achieving a 75% reduction in parameter scale and a 40% decrease in GPU memory usage, all while maintaining full accuracy.
中文: 本文提出了一种名为$\pi$-Quant的新型优化器状态压缩算法,利用无理数将参数通过旋转角度映射到复数空间,将参数位宽降至3.32位,在保持精度的同时显著降低了内存使用。
English: This paper introduces $\pi$-Quant, a novel optimizer state compression algorithm that uses irrational numbers to map parameters into complex space via rotation angles, reducing parameter bit-width to 3.32-bit while maintaining full accuracy and significantly cutting memory usage.

Authors:Ningyu Xu, Qi Zhang, Chao Du, Qiang Luo, Xipeng Qiu, Xuanjing Huang, Menghan Zhang
Title: Human-like conceptual representations emerge from language prediction
Abstract:
People acquire concepts through rich physical and social experiences and use them to understand the world. In contrast, large language models (LLMs), trained exclusively through next-token prediction over language data, exhibit remarkably human-like behaviors. Are these models developing concepts akin to humans, and if so, how are such concepts represented and organized? To address these questions, we reframed the classic reverse dictionary task to simulate human concept inference in context and investigated the emergence of human-like conceptual representations within LLMs. Our results demonstrate that LLMs can flexibly derive concepts from linguistic descriptions in relation to contextual cues about other concepts. The derived representations converged towards a shared, context-independent structure that effectively predicted human behavior across key psychological phenomena, including computation of similarities, categories and semantic scales. Moreover, these representations aligned well with neural activity patterns in the human brain, even in response to visual rather than linguistic stimuli, providing evidence for biological plausibility. These findings establish that structured, human-like conceptual representations can naturally emerge from language prediction without real-world grounding. More broadly, our work positions LLMs as promising computational tools for understanding complex human cognition and paves the way for better alignment between artificial and human intelligence.
中文: 研究表明,大语言模型仅通过语言预测就能形成类似人类的概念结构,能够根据上下文推断概念,并与人类心理模式和神经活动高度吻合。
English: Large language models can develop human-like conceptual structures through language prediction alone, as demonstrated by their ability to infer concepts from context and align with human psychological patterns and neural activity.

Authors:Guangyuan Liu, Hongyang Du, Jiacheng Wang, Dusit Niyato, Dong In Kim
Title: Contract-Inspired Contest Theory for Controllable Image Generation in Mobile Edge Metaverse
Abstract:
The rapid advancement of immersive technologies has propelled the development of the Metaverse, where the convergence of virtual and physical realities necessitates the generation of high-quality, photorealistic images to enhance user experience. However, generating these images, especially through Generative Diffusion Models (GDMs), in mobile edge computing environments presents significant challenges due to the limited computing resources of edge devices and the dynamic nature of wireless networks. This paper proposes a novel framework that integrates contract-inspired contest theory, Deep Reinforcement Learning (DRL), and GDMs to optimize image generation in these resource-constrained environments. The framework addresses the critical challenges of resource allocation and semantic data transmission quality by incentivizing edge devices to efficiently transmit high-quality semantic data, which is essential for creating realistic and immersive images. The use of contest and contract theory ensures that edge devices are motivated to allocate resources effectively, while DRL dynamically adjusts to network conditions, optimizing the overall image generation process. Experimental results demonstrate that the proposed approach not only improves the quality of generated images but also achieves superior convergence speed and stability compared to traditional methods. This makes the framework particularly effective for optimizing complex resource allocation tasks in mobile edge Metaverse applications, offering enhanced performance and efficiency in creating immersive virtual environments.
中文摘要:本文提出了一种融合契约激励竞赛理论、深度强化学习和生成扩散模型的新框架,旨在优化移动边缘计算环境中元宇宙应用的高质量图像生成,有效解决资源分配和语义数据传输的挑战。
English Summary: This paper introduces a novel framework combining contract-inspired contest theory, deep reinforcement learning, and generative diffusion models to optimize photorealistic image generation in resource-constrained mobile edge computing environments for the Metaverse.

Authors:Changyuan Zhao, Guangyuan Liu, Bin Xiang, Dusit Niyato, Benoit Delinchant, Hongyang Du, Dong In Kim
Title: Generative AI Enabled Robust Sensor Placement in Cyber-Physical Power Systems: A Graph Diffusion Approach
Abstract:
With advancements in physical power systems and network technologies, integrated Cyber-Physical Power Systems (CPPS) have significantly enhanced system monitoring and control efficiency and reliability. This integration, however, introduces complex challenges in designing coherent CPPS, particularly as few studies concurrently address the deployment of physical layers and communication connections in the cyber layer. This paper addresses these challenges by proposing a framework for robust sensor placement to optimize anomaly detection in the physical layer and enhance communication resilience in the cyber layer. We model the CPPS as an interdependent network via a graph, allowing for simultaneous consideration of both layers. Then, we adopt the Log-normal Shadowing Path Loss (LNSPL) model to ensure reliable data transmission. Additionally, we leverage the Fiedler value to measure graph resilience against line failures and three anomaly detectors to fortify system safety. However, the optimization problem is NP-hard. Therefore, we introduce the Experience Feedback Graph Diffusion (EFGD) algorithm, which utilizes a diffusion process to generate optimal sensor placement strategies. This algorithm incorporates cross-entropy gradient and experience feedback mechanisms to expedite convergence and generate higher reward strategies. Extensive simulations demonstrate that the EFGD algorithm enhances model convergence by 18.9% over existing graph diffusion methods and improves average reward by 22.90% compared to Denoising Diffusion Policy Optimization (DDPO) and 19.57% compared to Graph Diffusion Policy Optimization (GDPO), thereby significantly bolstering the robustness and reliability of CPPS operations.
中文: 本文提出了一种基于经验反馈图扩散(EFGD)算法的鲁棒传感器部署框架,通过相互依赖的网络建模增强信息物理电力系统的异常检测和通信韧性,显著提升了系统收敛速度和运行可靠性。
English: This paper proposes a robust sensor placement framework using the Experience Feedback Graph Diffusion (EFGD) algorithm to enhance anomaly detection and communication resilience in Cyber-Physical Power Systems, significantly improving convergence and operational reliability through interdependent network modeling.

Authors:Omkar Thawakar, Dinura Dissanayake, Ketan More, Ritesh Thawkar, Ahmed Heakl, Noor Ahsan, Yuhao Li, Mohammed Zumri, Jean Lahoud, Rao Muhammad Anwer, Hisham Cholakkal, Ivan Laptev, Mubarak Shah, Fahad Shahbaz Khan, Salman Khan
Title: LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs
Abstract:
Reasoning is a fundamental capability for solving complex multi-step problems, particularly in visual contexts where sequential step-wise understanding is essential. Existing approaches lack a comprehensive framework for evaluating visual reasoning and do not emphasize step-wise problem-solving. To this end, we propose a comprehensive framework for advancing step-by-step visual reasoning in large language models (LMMs) through three key contributions. First, we introduce a visual reasoning benchmark specifically designed to evaluate multi-step reasoning tasks. The benchmark presents a diverse set of challenges with eight different categories ranging from complex visual perception to scientific reasoning with over 4k reasoning steps in total, enabling robust evaluation of LLMs' abilities to perform accurate and interpretable visual reasoning across multiple steps. Second, we propose a novel metric that assesses visual reasoning quality at the granularity of individual steps, emphasizing both correctness and logical coherence. The proposed metric offers deeper insights into reasoning performance compared to traditional end-task accuracy metrics. Third, we present a new multimodal visual reasoning model, named LlamaV-o1, trained using a multi-step curriculum learning approach, where tasks are progressively organized to facilitate incremental skill acquisition and problem-solving. The proposed LlamaV-o1 is designed for multi-step reasoning and learns step-by-step through a structured training paradigm. Extensive experiments show that our LlamaV-o1 outperforms existing open-source models and performs favorably against close-source proprietary models. Compared to the recent Llava-CoT, our LlamaV-o1 achieves an average score of 67.3 with an absolute gain of 3.8\% across six benchmarks while being 5 times faster during inference scaling. Our benchmark, model, and code are publicly available.
中文: 本文提出了一个提升大语言模型逐步视觉推理能力的综合框架,包含专用评估基准、新型步骤级评测指标以及LlamaV-o1多模态模型,实验表明该模型在性能和推理速度上均显著优于现有方法。
English: This paper introduces a comprehensive framework to enhance step-by-step visual reasoning in large language models, featuring a specialized benchmark, a novel step-level evaluation metric, and the LlamaV-o1 model, which demonstrates superior performance and efficiency over existing approaches.

Authors:Geng Sun, Minghua Yuan, Zemin Sun, Jiacheng Wang, Hongyang Du, Dusit Niyato, Zhu Han, Dong In Kim
Title: Online Collaborative Resource Allocation and Task Offloading for Multi-access Edge Computing
Abstract:
Multi-access edge computing (MEC) is emerging as a promising paradigm to provide flexible computing services close to user devices (UDs). However, meeting the computation-hungry and delay-sensitive demands of UDs faces several challenges, including the resource constraints of MEC servers, inherent dynamic and complex features in the MEC system, and difficulty in dealing with the time-coupled and decision-coupled optimization. In this work, we first present an edge-cloud collaborative MEC architecture, where the MEC servers and cloud collaboratively provide offloading services for UDs. Moreover, we formulate an energy-efficient and delay-aware optimization problem (EEDAOP) to minimize the energy consumption of UDs under the constraints of task deadlines and long-term queuing delays. Since the problem is proved to be non-convex mixed integer nonlinear programming (MINLP), we propose an online joint communication resource allocation and task offloading approach (OJCTA). Specifically, we transform EEDAOP into a real-time optimization problem by employing the Lyapunov optimization framework. Then, to solve the real-time optimization problem, we propose a communication resource allocation and task offloading optimization method by employing the Tammer decomposition mechanism, convex optimization method, bilateral matching mechanism, and dependent rounding method. Simulation results demonstrate that the proposed OJCTA can achieve superior system performance compared to the benchmark approaches.
中文: 多接入边缘计算(MEC)虽能就近提供计算服务,却面临资源分配与优化难题,本研究通过边云协同架构和在线联合优化方法,有效提升能效并降低延迟。
English: Multi-access edge computing (MEC) offers localized computing services but faces challenges in resource allocation and optimization, which this study addresses through an edge-cloud collaborative architecture and an online joint optimization approach that enhances energy efficiency and reduces delays.

Authors:Wenwen Xie, Geng Sun, Bei Liu, Jiahui Li, Jiacheng Wang, Hongyang Du, Dusit Niyato, Dong In Kim
Title: Joint Optimization of UAV-Carried IRS for Urban Low Altitude mmWave Communications with Deep Reinforcement Learning
Abstract:
Emerging technologies in sixth generation (6G) of wireless communications, such as terahertz communication and ultra-massive multiple-input multiple-output, present promising prospects. Despite the high data rate potential of millimeter wave communications, millimeter wave (mmWave) communications in urban low altitude economy (LAE) environments are constrained by challenges such as signal attenuation and multipath interference. Specially, in urban environments, mmWave communication experiences significant attenuation due to buildings, owing to its short wavelength, which necessitates developing innovative approaches to improve the robustness of such communications in LAE networking. In this paper, we explore the use of an unmanned aerial vehicle (UAV)-carried intelligent reflecting surface (IRS) to support low altitude mmWave communication. Specifically, we consider a typical urban low altitude communication scenario where a UAV-carried IRS establishes a line-of-sight (LoS) channel between the mobile users and a source user (SU) despite the presence of obstacles. Subsequently, we formulate an optimization problem aimed at maximizing the transmission rates and minimizing the energy consumption of the UAV by jointly optimizing phase shifts of the IRS and UAV trajectory. Given the non-convex nature of the problem and its high dynamics, we propose a deep reinforcement learning-based approach incorporating neural episodic control, long short-term memory, and an IRS phase shift control method to enhance the stability and accelerate the convergence. Simulation results show that the proposed algorithm effectively resolves the problem and surpasses other benchmark algorithms in various performances.
中文摘要:本文提出利用无人机搭载智能反射面增强城市低空毫米波通信,通过深度强化学习联合优化反射面相位与无人机轨迹,仿真结果表明该算法能有效提升通信性能。
English Summary: This paper proposes using a UAV-carried intelligent reflecting surface to enhance millimeter wave communication in urban low-altitude environments by optimizing phase shifts and UAV trajectory through a deep reinforcement learning approach, demonstrating superior performance in simulations.

Authors:Junjie Ye, Zhengyin Du, Xuesong Yao, Weijian Lin, Yufei Xu, Zehui Chen, Zaiyuan Wang, Sining Zhu, Zhiheng Xi, Siyu Yuan, Tao Gui, Qi Zhang, Xuanjing Huang, Jiecao Chen
Title: ToolHop: A Query-Driven Benchmark for Evaluating Large Language Models in Multi-Hop Tool Use
Abstract:
Effective evaluation of multi-hop tool use is critical for analyzing the understanding, reasoning, and function-calling capabilities of large language models (LLMs). However, progress has been hindered by a lack of reliable evaluation datasets. To address this, we present ToolHop, a dataset comprising 995 user queries and 3,912 associated tools, specifically designed for rigorous evaluation of multi-hop tool use. ToolHop ensures diverse queries, meaningful interdependencies, locally executable tools, detailed feedback, and verifiable answers through a novel query-driven data construction approach that includes tool creation, document refinement, and code generation. We evaluate 14 LLMs across five model families (i.e., LLaMA3.1, Qwen2.5, Gemini1.5, Claude3.5, and GPT), uncovering significant challenges in handling multi-hop tool-use scenarios. The leading model, GPT-4o, achieves an accuracy of 49.04%, underscoring substantial room for improvement. Further analysis reveals variations in tool-use strategies for various families, offering actionable insights to guide the development of more effective approaches. Code and data can be found in https://huggingface.co/datasets/bytedance-research/ToolHop.
中文: ToolHop数据集通过995个查询和3,912个工具解决了多跳工具使用评估基准缺失的问题,揭示了大型语言模型在此任务中的显著挑战——最佳模型GPT-4o仅达49.04%准确率,并为未来发展提供了改进方向。
English: The ToolHop dataset addresses the lack of reliable evaluation benchmarks for multi-hop tool use by providing 995 queries and 3,912 tools, revealing significant challenges for LLMs with top-performing GPT-4o achieving only 49.04% accuracy and offering insights for future improvements.

Authors:Xiaoxue Cheng, Junyi Li, Wayne Xin Zhao, Ji-Rong Wen
Title: Think More, Hallucinate Less: Mitigating Hallucinations via Dual Process of Fast and Slow Thinking
Abstract:
Large language models (LLMs) demonstrate exceptional capabilities, yet still face the hallucination issue. Typical text generation approaches adopt an auto-regressive generation without deliberate reasoning, which often results in untrustworthy and factually inaccurate responses. In this paper, we propose HaluSearch, a novel framework that incorporates tree search-based algorithms (e.g. MCTS) to enable an explicit slow thinking generation process for mitigating hallucinations of LLMs during inference. Specifically, HaluSearch frames text generation as a step-by-step reasoning process, using a self-evaluation reward model to score each generation step and guide the tree search towards the most reliable generation pathway for fully exploiting the internal knowledge of LLMs. To balance efficiency and quality, we introduce a hierarchical thinking system switch mechanism inspired by the dual process theory in cognitive science, which dynamically alternates between fast and slow thinking modes at both the instance and step levels, adapting to the complexity of questions and reasoning states. We conduct extensive experiments on both English and Chinese datasets and the results show that our approach significantly outperforms baseline approaches.
中文摘要:HaluSearch提出了一种基于树搜索的新框架,通过引入显式推理过程和分层思维机制,有效引导文本生成走向可靠路径,从而显著减少大语言模型的幻觉问题。
English Summary: HaluSearch introduces a tree search-based framework that integrates explicit reasoning and a hierarchical thinking mechanism to significantly reduce hallucinations in large language models by guiding generation toward reliable pathways.

Authors:Ruichen Zhang, Changyuan Zhao, Hongyang Du, Dusit Niyato, Jiacheng Wang, Suttinee Sawadsitang, Xuemin Shen, Dong In Kim
Title: Embodied AI-Enhanced Vehicular Networks: An Integrated Large Language Models and Reinforcement Learning Method
Abstract:
This paper investigates adaptive transmission strategies in embodied AI-enhanced vehicular networks by integrating large language models (LLMs) for semantic information extraction and deep reinforcement learning (DRL) for decision-making. The proposed framework aims to optimize both data transmission efficiency and decision accuracy by formulating an optimization problem that incorporates the Weber-Fechner law, serving as a metric for balancing bandwidth utilization and quality of experience (QoE). Specifically, we employ the large language and vision assistant (LLAVA) model to extract critical semantic information from raw image data captured by embodied AI agents (i.e., vehicles), reducing transmission data size by approximately more than 90\% while retaining essential content for vehicular communication and decision-making. In the dynamic vehicular environment, we employ a generalized advantage estimation-based proximal policy optimization (GAE-PPO) method to stabilize decision-making under uncertainty. Simulation results show that attention maps from LLAVA highlight the model's focus on relevant image regions, enhancing semantic representation accuracy. Additionally, our proposed transmission strategy improves QoE by up to 36\% compared to DDPG and accelerates convergence by reducing required steps by up to 47\% compared to pure PPO. Further analysis indicates that adapting semantic symbol length provides an effective trade-off between transmission quality and bandwidth, achieving up to a 61.4\% improvement in QoE when scaling from 4 to 8 vehicles.
本文提出了一种结合大语言模型进行语义数据压缩和深度强化学习优化决策的自适应车载网络传输框架,显著提升了传输效率和体验质量。
This paper introduces an adaptive transmission framework for vehicular networks that combines large language models for semantic data compression and deep reinforcement learning for optimized decision-making, significantly enhancing transmission efficiency and quality of experience.

Authors:Shuyang Jiang, Yusheng Liao, Zhe Chen, Ya Zhang, Yanfeng Wang, Yu Wang
Title: MedS$^3$: Towards Medical Slow Thinking with Self-Evolved Soft Dual-sided Process Supervision
Abstract:
Medical language models face critical barriers to real-world clinical reasoning applications. However, mainstream efforts, which fall short in task coverage, lack fine-grained supervision for intermediate reasoning steps, and rely on proprietary systems, are still far from a versatile, credible and efficient language model for clinical reasoning usage. To this end, we propose \mone, a self-evolving framework that imparts robust reasoning capabilities to small, deployable models. Starting with 8,000 curated instances sampled via a curriculum strategy across five medical domains and 16 datasets, we use a small base policy model to conduct Monte Carlo Tree Search (MCTS) for constructing rule-verifiable reasoning trajectories. Self-explored reasoning trajectories ranked by node values are used to bootstrap the policy model via reinforcement fine-tuning and preference learning. Moreover, we introduce a soft dual process reward model that incorporates value dynamics: steps that degrade node value are penalized, enabling fine-grained identification of reasoning errors even when the final answer is correct. Experiments on eleven benchmarks show that \mone outperforms the previous state-of-the-art medical model by +6.45 accuracy points and surpasses 32B-scale general-purpose reasoning models by +8.57 points. Additional empirical analysis further demonstrates that \mone achieves robust and faithful reasoning behavior.
Chinese: 提出的自我进化框架 \mone 通过蒙特卡洛树搜索和强化学习生成可验证的推理轨迹,显著提升了小型医疗语言模型的临床推理能力,在多项基准测试中取得了最优性能。
English: The proposed self-evolving framework, \mone, enhances small medical language models by generating verified reasoning trajectories through Monte Carlo Tree Search and reinforcement learning, achieving state-of-the-art performance on clinical reasoning benchmarks.

Authors:Benyuan Liu, Xu Chen, Yanfeng Wang, Ya Zhang, Zhi Cao, Ivor Tsang
Title: Active Sampling for Node Attribute Completion on Graphs
Abstract:
Node attribute, a type of crucial information for graph analysis, may be partially or completely missing for certain nodes in real world applications. Restoring the missing attributes is expected to benefit downstream graph learning. Few attempts have been made on node attribute completion, but a novel framework called Structure-attribute Transformer (SAT) was recently proposed by using a decoupled scheme to leverage structures and attributes. SAT ignores the differences in contributing to the learning schedule and finding a practical way to model the different importance of nodes with observed attributes is challenging. This paper proposes a novel AcTive Sampling algorithm (ATS) to restore missing node attributes. The representativeness and uncertainty of each node's information are first measured based on graph structure, representation similarity and learning bias. To select nodes as train samples in the next optimization step, a weighting scheme controlled by Beta distribution is then introduced to linearly combine the two properties. Extensive experiments on four public benchmark datasets and two downstream tasks have shown the superiority of ATS in node attribute completion.
中文: 本文提出新颖的主动采样算法ATS,通过基于图结构和表示相似性衡量节点的代表性与不确定性,并采用Beta分布控制的加权方案选择训练样本,实验证明其在节点属性补全方面具有优越性能。
English: This paper introduces the AcTive Sampling algorithm (ATS), which effectively restores missing node attributes by measuring node representativeness and uncertainty through graph structure and representation similarity, then using a Beta distribution-based weighting scheme to select optimal training samples, demonstrating superior performance in experiments.

Authors:Soonhyo Kim, Naoaki Kanazawa, Shun Hasegawa, Kento Kawaharazuka, Kei Okada
Title: Front Hair Styling Robot System Using Path Planning for Root-Centric Strand Adjustment
Abstract:
Hair styling is a crucial aspect of personal grooming, significantly influenced by the appearance of front hair. While brushing is commonly used both to detangle hair and for styling purposes, existing research primarily focuses on robotic systems for detangling hair, with limited exploration into robotic hair styling. This research presents a novel robotic system designed to automatically adjust front hairstyles, with an emphasis on path planning for root-centric strand adjustment. The system utilizes images to compare the current hair state with the desired target state through an orientation map of hair strands. By concentrating on the differences in hair orientation and specifically targeting adjustments at the root of each strand, the system performs detailed styling tasks. The path planning approach ensures effective alignment of the hairstyle with the target, and a closed-loop mechanism refines these adjustments to accurately evolve the hairstyle towards the desired outcome. Experimental results demonstrate that the proposed system achieves a high degree of similarity and consistency in front hair styling, showing promising results for automated, precise hairstyle adjustments.
中文: 本研究提出一种机器人系统,通过图像分析对比当前与目标发丝方向,采用闭环路径规划在发根层面进行发束调整,实现了高精度、高一致性的前发自动造型。
English: This study introduces a robotic system that automates front hairstyling by comparing current and target hair orientations through image analysis and performing root-level strand adjustments via closed-loop path planning, achieving high styling precision and consistency.

Authors:Yucheng Ding, Yangwenjian Tan, Xiangyu Liu, Chaoyue Niu, Fandong Meng, Jie Zhou, Ning Liu, Fan Wu, Guihai Chen
Title: Personalized Language Model Learning on Text Data Without User Identifiers
Abstract:
In many practical natural language applications, user data are highly sensitive, requiring anonymous uploads of text data from mobile devices to the cloud without user identifiers. However, the absence of user identifiers restricts the ability of cloud-based language models to provide personalized services, which are essential for catering to diverse user needs. The trivial method of replacing an explicit user identifier with a static user embedding as model input still compromises data anonymization. In this work, we propose to let each mobile device maintain a user-specific distribution to dynamically generate user embeddings, thereby breaking the one-to-one mapping between an embedding and a specific user. We further theoretically demonstrate that to prevent the cloud from tracking users via uploaded embeddings, the local distributions of different users should either be derived from a linearly dependent space to avoid identifiability or be close to each other to prevent accurate attribution. Evaluation on both public and industrial datasets using different language models reveals a remarkable improvement in accuracy from incorporating anonymous user embeddings, while preserving real-time inference requirement.
Chinese: 本研究提出了一种动态用户嵌入方法,通过从设备特定分布生成嵌入来保持用户匿名性,在防止直接用户追踪的同时,提升了云端语言模型个性化服务的准确性。
English: This study introduces a dynamic user embedding method that maintains user anonymity by generating embeddings from device-specific distributions, preventing direct user tracking while enhancing personalized service accuracy in cloud-based language models.

Authors:Zecheng Li, Wengang Zhou, Weichao Zhao, Kepeng Wu, Hezhen Hu, Houqiang Li
Title: Uni-Sign: Toward Unified Sign Language Understanding at Scale
Abstract:
Sign language pre-training has gained increasing attention for its ability to enhance performance across various sign language understanding (SLU) tasks. However, existing methods often suffer from a gap between pre-training and fine-tuning, leading to suboptimal results. To address this, we propose Uni-Sign, a unified pre-training framework that eliminates the gap between pre-training and downstream SLU tasks through a large-scale generative pre-training strategy and a novel fine-tuning paradigm. First, we introduce CSL-News, a large-scale Chinese Sign Language (CSL) dataset containing 1,985 hours of video paired with textual annotations, which enables effective large-scale pre-training. Second, Uni-Sign unifies SLU tasks by treating downstream tasks as a single sign language translation (SLT) task during fine-tuning, ensuring seamless knowledge transfer between pre-training and fine-tuning. Furthermore, we incorporate a prior-guided fusion (PGF) module and a score-aware sampling strategy to efficiently fuse pose and RGB information, addressing keypoint inaccuracies and improving computational efficiency. Extensive experiments across multiple SLU benchmarks demonstrate that Uni-Sign achieves state-of-the-art performance across multiple downstream SLU tasks. Dataset and code are available at github.com/ZechengLi19/Uni-Sign.
Chinese Summary: Uni-Sign 是一个统一的预训练框架,通过大规模生成式预训练和创新的微调范式消除了手语理解任务中预训练与下游任务之间的差距,在多个基准测试中实现了最先进的性能。
English Summary: Uni-Sign is a unified pre-training framework that bridges the gap between pre-training and fine-tuning for sign language understanding tasks by employing large-scale generative pre-training and a novel fine-tuning approach, achieving state-of-the-art performance across multiple benchmarks.

Authors:Longtao Jiang, Zhendong Wang, Jianmin Bao, Wengang Zhou, Dongdong Chen, Lei Shi, Dong Chen, Houqiang Li
Title: SmartEraser: Remove Anything from Images using Masked-Region Guidance
Abstract:
Object removal has so far been dominated by the mask-and-inpaint paradigm, where the masked region is excluded from the input, leaving models relying on unmasked areas to inpaint the missing region. However, this approach lacks contextual information for the masked area, often resulting in unstable performance. In this work, we introduce SmartEraser, built with a new removing paradigm called Masked-Region Guidance. This paradigm retains the masked region in the input, using it as guidance for the removal process. It offers several distinct advantages: (a) it guides the model to accurately identify the object to be removed, preventing its regeneration in the output; (b) since the user mask often extends beyond the object itself, it aids in preserving the surrounding context in the final result. Leveraging this new paradigm, we present Syn4Removal, a large-scale object removal dataset, where instance segmentation data is used to copy and paste objects onto images as removal targets, with the original images serving as ground truths. Experimental results demonstrate that SmartEraser significantly outperforms existing methods, achieving superior performance in object removal, especially in complex scenes with intricate compositions.
中文摘要:SmartEraser提出了一种新颖的掩码区域引导范式,将掩码区域保留为输入引导,既能精确去除目标物体又能保持上下文,并通过新数据集和实验验证了其卓越性能。
English Summary: SmartEraser introduces a novel Masked-Region Guidance paradigm that retains masked regions as input guidance, enabling precise object removal while preserving context, and demonstrates superior performance through a new dataset and experiments.

Authors:Yantao Liu, Zijun Yao, Rui Min, Yixin Cao, Lei Hou, Juanzi Li
Title: PairJudge RM: Perform Best-of-N Sampling with Knockout Tournament
Abstract:
Best-of-N (BoN) sampling, a common strategy for test-time scaling of Large Language Models (LLMs), relies on reward models to select the best candidate solution from multiple generations. However, traditional reward models often assign arbitrary and inconsistent scores, limiting their effectiveness. To address this, we propose a Pairwise Judge Reward Model (PariJudge RM) combined with a knockout tournament for BoN sampling. Instead of assigning absolute scores, given one math problem, PariJudge RM judges two candidate solutions' correctness with chain-of-thought reasoning simultaneously. This approach eliminates the need for scoring and enables cross-validation of solutions through parallel judgment. In the knockout tournament, PariJudge RM conducts pairwise Judgment between candidate solutions and eliminates the incorrect ones iteratively. We construct PairJudge-432K, a large-scale dataset of 432K pairwise judgments derived from NumiaMath and annotated using \texttt{gemini-1.5-flash}, and train the PariJudge RM via supervised fine-tuning. Experiments on MATH-500 and the Olympiad Bench demonstrate significant improvements over baseline reward models. And a 40\% to 60\% relative improvement is achieved on the top 50\% challenging problems.
Chinese: 针对传统奖励模型在最佳N采样中的不足,成对评判奖励模型通过并行判断解决方案的正确性并结合淘汰赛机制,显著提升了在复杂数学问题上的表现。
English: To overcome the limitations of traditional reward models in Best-of-N sampling, the Pairwise Judge Reward Model (PariJudge RM) evaluates candidate solutions through pairwise correctness judgments and a knockout tournament, achieving substantial performance gains on challenging math problems.

Authors:Jiakang Yuan, Xiangchao Yan, Shiyang Feng, Bo Zhang, Tao Chen, Botian Shi, Wanli Ouyang, Yu Qiao, Lei Bai, Bowen Zhou
Title: Dolphin: Moving Towards Closed-loop Auto-research through Thinking, Practice, and Feedback
Abstract:
The scientific research paradigm is undergoing a profound transformation owing to the development of Artificial Intelligence (AI). Recent works demonstrate that various AI-assisted research methods can largely improve research efficiency by improving data analysis, accelerating computation, and fostering novel idea generation. To further move towards the ultimate goal (i.e., automatic scientific research), in this paper, we introduce Dolphin, a closed-loop LLM-driven framework to enhance the automation level of scientific research. Dolphin first generates novel ideas based on feedback from previous experiments and relevant papers ranked by the topic and task attributes. Then, the generated ideas can be implemented using a code template refined and debugged with the designed exception-traceback-guided local code structure. Finally, Dolphin automatically analyzes the results of each idea and feeds the results back to the next round of idea generation. Experiments are conducted on the benchmark datasets of different topics and a subset of MLE-bench. Results show that Dolphin can continuously improve the performance of the input topic in a loop. We highlight that Dolphin can automatically propose methods that are comparable to the state-of-the-art in some tasks such as 3D point classification.
Chinese: Dolphin框架提出了一种闭环、大语言模型驱动的系统,通过生成想法、实施代码和分析结果来自动化科学研究,在3D点分类等任务中展现出与最先进方法相媲美的性能提升。
English: The Dolphin framework introduces a closed-loop, LLM-driven system that automates scientific research by generating ideas, implementing code, and analyzing results, demonstrating performance improvements comparable to state-of-the-art methods in tasks like 3D point classification.

Authors:Yubao Tang, Ruqing Zhang, Jiafeng Guo, Maarten de Rijke, Shihao Liu, Shuaiqing Wang, Dawei Yin, Xueqi Cheng
Title: Generative Retrieval for Book search
Abstract:
In book search, relevant book information should be returned in response to a query. Books contain complex, multi-faceted information such as metadata, outlines, and main text, where the outline provides hierarchical information between chapters and sections. Generative retrieval (GR) is a new retrieval paradigm that consolidates corpus information into a single model to generate identifiers of documents that are relevant to a given query. How can GR be applied to book search? Directly applying GR to book search is a challenge due to the unique characteristics of book search: The model needs to retain the complex, multi-faceted information of the book, which increases the demand for labeled data. Splitting book information and treating it as a collection of separate segments for learning might result in a loss of hierarchical information. We propose an effective Generative retrieval framework for Book Search (GBS) that features two main components: data augmentation and outline-oriented book encoding. For data augmentation, GBS constructs multiple query-book pairs for training; it constructs multiple book identifiers based on the outline, various forms of book contents, and simulates real book retrieval scenarios with varied pseudo-queries. This includes coverage-promoting book identifier augmentation, allowing the model to learn to index effectively, and diversity-enhanced query augmentation, allowing the model to learn to retrieve effectively. Outline-oriented book encoding improves length extrapolation through bi-level positional encoding and retentive attention mechanisms to maintain context over long sequences. Experiments on a proprietary Baidu dataset demonstrate that GBS outperforms strong baselines, achieving a 9.8\% improvement in terms of MRR@20, over the state-of-the-art RIPOR method...
中文: 生成式检索在图书搜索中面临保留层次信息和数据增强的挑战,因此提出的GBS框架采用大纲导向编码和查询增强来提升检索效果。
English: Generative retrieval for book search faces challenges in preserving hierarchical information and requires data augmentation, so the proposed GBS framework uses outline-oriented encoding and query augmentation to enhance performance.

Authors:Zhenyu Hou, Xin Lv, Rui Lu, Jiajie Zhang, Yujiang Li, Zijun Yao, Juanzi Li, Jie Tang, Yuxiao Dong
Title: T1: Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling
Abstract:
Large language models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks. However, existing approaches mainly rely on imitation learning and struggle to achieve effective test-time scaling. While reinforcement learning (RL) holds promise for enabling self-exploration, recent attempts yield modest improvements in complex reasoning. In this paper, we present T1 to scale RL by encouraging exploration and understand inference scaling. We first initialize the LLM using synthesized chain-of-thought data that integrates trial-and-error and self-verification. To scale RL training, we promote increased sampling diversity through oversampling. We demonstrate that T1 with open LLMs as its base exhibits inference scaling behavior and achieves superior performance on challenging math reasoning benchmarks. More importantly, we present a simple strategy to examine inference scaling, where increased inference budgets directly lead to T1's better performance without any additional verification.
中文:T1方法通过结合合成的思维链数据与通过过采样促进探索的强化学习,提升了大型语言模型的复杂推理能力,在数学基准测试中表现出优越性能,并实现了无需额外验证的有效推理扩展。
English: The T1 method enhances large language models' complex reasoning by combining synthesized chain-of-thought data with reinforcement learning that promotes exploration through oversampling, demonstrating superior performance on math benchmarks and effective inference scaling without additional verification.

Authors:Wenyi Hong, Yean Cheng, Zhuoyi Yang, Weihan Wang, Lefan Wang, Xiaotao Gu, Shiyu Huang, Yuxiao Dong, Jie Tang
Title: MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models
Abstract:
In recent years, vision language models (VLMs) have made significant advancements in video understanding. However, a crucial capability - fine-grained motion comprehension - remains under-explored in current benchmarks. To address this gap, we propose MotionBench, a comprehensive evaluation benchmark designed to assess the fine-grained motion comprehension of video understanding models. MotionBench evaluates models' motion-level perception through six primary categories of motion-oriented question types and includes data collected from diverse sources, ensuring a broad representation of real-world video content. Experimental results reveal that existing VLMs perform poorly in understanding fine-grained motions. To enhance VLM's ability to perceive fine-grained motion within a limited sequence length of LLM, we conduct extensive experiments reviewing VLM architectures optimized for video feature compression and propose a novel and efficient Through-Encoder (TE) Fusion method. Experiments show that higher frame rate inputs and TE Fusion yield improvements in motion understanding, yet there is still substantial room for enhancement. Our benchmark aims to guide and motivate the development of more capable video understanding models, emphasizing the importance of fine-grained motion comprehension. Project page: https://motion-bench.github.io .
中文: MotionBench作为一个评估基准,旨在检验视觉语言模型在细粒度运动理解上的能力,揭示了现有模型的不足,并提出了一种高效的Through-Encoder融合方法,通过提高帧率输入来增强运动感知。
English: MotionBench is introduced as a benchmark to evaluate fine-grained motion comprehension in vision language models, revealing their current limitations and proposing a Through-Encoder Fusion method that improves performance with higher frame rates.

Authors:Jiawei Liu, Yuanzhi Zhu, Feiyu Gao, Zhibo Yang, Peng Wang, Junyang Lin, Xinggang Wang, Wenyu Liu
Title: SceneVTG++: Controllable Multilingual Visual Text Generation in the Wild
Abstract:
Generating visual text in natural scene images is a challenging task with many unsolved problems. Different from generating text on artificially designed images (such as posters, covers, cartoons, etc.), the text in natural scene images needs to meet the following four key criteria: (1) Fidelity: the generated text should appear as realistic as a photograph and be completely accurate, with no errors in any of the strokes. (2) Reasonability: the text should be generated on reasonable carrier areas (such as boards, signs, walls, etc.), and the generated text content should also be relevant to the scene. (3) Utility: the generated text can facilitate to the training of natural scene OCR (Optical Character Recognition) tasks. (4) Controllability: The attribute of the text (such as font and color) should be controllable as needed. In this paper, we propose a two stage method, SceneVTG++, which simultaneously satisfies the four aspects mentioned above. SceneVTG++ consists of a Text Layout and Content Generator (TLCG) and a Controllable Local Text Diffusion (CLTD). The former utilizes the world knowledge of multi modal large language models to find reasonable text areas and recommend text content according to the nature scene background images, while the latter generates controllable multilingual text based on the diffusion model. Through extensive experiments, we respectively verified the effectiveness of TLCG and CLTD, and demonstrated the state-of-the-art text generation performance of SceneVTG++. In addition, the generated images have superior utility in OCR tasks like text detection and text recognition. Codes and datasets will be available.
Chinese: 本文提出SceneVTG++方法,通过两阶段设计在自然场景图像中生成逼真、合理且可控的文本,实验证明其满足保真度、合理性、实用性和可控性要求,并展现出领先的文本生成性能。
English: This paper introduces SceneVTG++, a two-stage method that generates realistic and contextually appropriate text in natural scene images, meeting fidelity, reasonability, utility, and controllability criteria, and demonstrates state-of-the-art performance in experiments.

Authors:Peilong Wang, Zhengliang Liu, Yiwei Li, Jason Holmes, Peng Shu, Lian Zhang, Xiang Li, Quanzheng Li, Brady S. Laughlin, Diego Santos Toesca, Sujay A. Vora, Samir H. Patel, Terence T. Sio, Tianming Liu, Wei Liu
Title: Fine-Tuning Open-Source Large Language Models to Improve Their Performance on Radiation Oncology Tasks: A Feasibility Study to Investigate Their Potential Clinical Applications in Radiation Oncology
Abstract:
Background: The radiation oncology clinical practice involves many steps relying on the dynamic interplay of abundant text data. Large language models have displayed remarkable capabilities in processing complex text information. But their direct applications in specific fields like radiation oncology remain underexplored. Purpose: This study aims to investigate whether fine-tuning LLMs with domain knowledge can improve the performance on Task (1) treatment regimen generation, Task (2) treatment modality selection (photon, proton, electron, or brachytherapy), and Task (3) ICD-10 code prediction in radiation oncology. Methods: Data for 15,724 patient cases were extracted. Cases where patients had a single diagnostic record, and a clearly identifiable primary treatment plan were selected for preprocessing and manual annotation to have 7,903 cases of the patient diagnosis, treatment plan, treatment modality, and ICD-10 code. Each case was used to construct a pair consisting of patient diagnostics details and an answer (treatment regimen, treatment modality, or ICD-10 code respectively) for the supervised fine-tuning of these three tasks. Open source LLaMA2-7B and Mistral-7B models were utilized for the fine-tuning with the Low-Rank Approximations method. Accuracy and ROUGE-1 score were reported for the fine-tuned models and original models. Clinical evaluation was performed on Task (1) by radiation oncologists, while precision, recall, and F-1 score were evaluated for Task (2) and (3). One-sided Wilcoxon signed-rank tests were used to statistically analyze the results. Results: Fine-tuned LLMs outperformed original LLMs across all tasks with p-value <= 0.001. Clinical evaluation demonstrated that over 60% of the fine-tuned LLMs-generated treatment regimens were clinically acceptable. Precision, recall, and F1-score showed improved performance of fine-tuned LLMs.
中文: 本研究证明,通过领域知识微调的大语言模型在放射肿瘤学的治疗方案生成、治疗方式选择和ICD-10编码预测任务中表现显著提升,临床评估和统计分析均验证了其有效性。
English: This study demonstrates that fine-tuning large language models with domain-specific knowledge significantly enhances their performance in radiation oncology tasks, including treatment regimen generation, modality selection, and ICD-10 code prediction, as validated through clinical evaluation and statistical analysis.

Authors:Wei Ruan, Yanjun Lyu, Jing Zhang, Jiazhang Cai, Peng Shu, Yang Ge, Yao Lu, Shang Gao, Yue Wang, Peilong Wang, Lin Zhao, Tao Wang, Yufang Liu, Luyang Fang, Ziyu Liu, Zhengliang Liu, Yiwei Li, Zihao Wu, Junhao Chen, Hanqi Jiang, Yi Pan, Zhenyuan Yang, Jingyuan Chen, Shizhe Liang, Wei Zhang, Terry Ma, Yuan Dou, Jianli Zhang, Xinyu Gong, Qi Gan, Yusong Zou, Zebang Chen, Yuanxin Qian, Shuo Yu, Jin Lu, Kenan Song, Xianqiao Wang, Andrea Sikora, Gang Li, Xiang Li, Quanzheng Li, Yingfeng Wang, Lu Zhang, Yohannes Abate, Lifang He, Wenxuan Zhong, Rongjie Liu, Chao Huang, Wei Liu, Ye Shen, Ping Ma, Hongtu Zhu, Yajun Yan, Dajiang Zhu, Tianming Liu
Title: Large Language Models for Bioinformatics
Abstract:
With the rapid advancements in large language model (LLM) technology and the emergence of bioinformatics-specific language models (BioLMs), there is a growing need for a comprehensive analysis of the current landscape, computational characteristics, and diverse applications. This survey aims to address this need by providing a thorough review of BioLMs, focusing on their evolution, classification, and distinguishing features, alongside a detailed examination of training methodologies, datasets, and evaluation frameworks. We explore the wide-ranging applications of BioLMs in critical areas such as disease diagnosis, drug discovery, and vaccine development, highlighting their impact and transformative potential in bioinformatics. We identify key challenges and limitations inherent in BioLMs, including data privacy and security concerns, interpretability issues, biases in training data and model outputs, and domain adaptation complexities. Finally, we highlight emerging trends and future directions, offering valuable insights to guide researchers and clinicians toward advancing BioLMs for increasingly sophisticated biological and clinical applications.
中文摘要:本综述全面分析了生物信息学专用语言模型(BioLMs)的发展历程、在疾病诊断与药物研发等领域的应用,并针对数据隐私、模型偏差等挑战提出未来研究方向。
English Summary: This survey comprehensively reviews bioinformatics-specific language models (BioLMs), detailing their evolution, applications in disease diagnosis and drug discovery, while addressing challenges like data privacy and biases to guide future advancements.

Authors:Yiteng Tu, Weihang Su, Yujia Zhou, Yiqun Liu, Qingyao Ai
Title: RbFT: Robust Fine-tuning for Retrieval-Augmented Generation against Retrieval Defects
Abstract:
Retrieval-augmented generation (RAG) enhances large language models (LLMs) by integrating external knowledge retrieved from a knowledge base. However, its effectiveness is fundamentally constrained by the reliability of both the retriever and the knowledge base. In real-world scenarios, imperfections in these components often lead to the retrieval of noisy, irrelevant, or misleading counterfactual information, ultimately undermining the trustworthiness of RAG systems. To address this challenge, we propose Robust Fine-Tuning (RbFT), a method designed to enhance the resilience of LLMs against retrieval defects through two targeted fine-tuning tasks. Experimental results demonstrate that RbFT significantly improves the robustness of RAG systems across diverse retrieval conditions, surpassing existing methods while maintaining high inference efficiency and compatibility with other robustness techniques.
中文: RAG系统因检索不准确而存在可靠性问题,但提出的RbFT方法通过针对性微调增强了大语言模型的抗干扰能力,在不同检索条件下显著提升了系统鲁棒性,同时保持了高效性。
English: RAG systems face reliability issues due to retrieval inaccuracies, but the proposed RbFT method enhances LLM resilience through specialized fine-tuning, significantly improving robustness across varied conditions while maintaining efficiency.

Authors:Zhenhao Zhu, Bulou Liu, Qingyao Ai, Yiqun Liu
Title: Option-ID Based Elimination For Multiple Choice Questions
Abstract:
Multiple choice questions (MCQs) are a popular and important task for evaluating large language models (LLMs). Based on common strategies people use when answering MCQs, the process of elimination (PoE) has been proposed as an effective problem-solving method. Existing PoE methods typically either have LLMs directly identify incorrect options or score options and replace lower-scoring ones with [MASK]. However, both methods suffer from inapplicability or suboptimal performance. To address these issues, this paper proposes a novel option-ID based PoE ($\text{PoE}_{\text{ID}}$). $\text{PoE}_{\text{ID}}$ critically incorporates a debiasing technique to counteract LLMs token bias, enhancing robustness over naive ID-based elimination. It features two strategies: $\text{PoE}_{\text{ID}}^{\text{log}}$, which eliminates options whose IDs have log probabilities below the average threshold, and $\text{PoE}_{\text{ID}}^{\text{seq}}$, which iteratively removes the option with the lowest ID probability. We conduct extensive experiments with 6 different LLMs on 4 diverse datasets. The results demonstrate that $\text{PoE}_{\text{ID}}$, especially $\text{PoE}_{\text{ID}}^{\text{log}}$, significantly improves zero-shot and few-shot MCQs performance, particularly in datasets with more options. Our analyses demonstrate that $\text{PoE}_{\text{ID}}^{\text{log}}$ enhances the LLMs' confidence in selecting the correct option, and the option elimination strategy outperforms methods relying on [MASK] replacement. We further investigate the limitations of LLMs in directly identifying incorrect options, which stem from their inherent deficiencies.
中文: 本文提出了一种基于选项ID的新型排除法(PoE_ID),通过去偏技术增强大语言模型在选择题中的表现,实验证明该方法尤其在多选项数据集中显著优于现有方法。
English: This paper introduces a novel option-ID based process of elimination (PoE_ID) with debiasing to enhance large language models' performance on multiple-choice questions, demonstrating significant improvements over existing methods through extensive experiments.

Authors:Rong Shan, Jiachen Zhu, Jianghao Lin, Chenxu Zhu, Bo Chen, Ruiming Tang, Yong Yu, Weinan Zhang
Title: Full-Stack Optimized Large Language Models for Lifelong Sequential Behavior Comprehension in Recommendation
Abstract:
In this paper, we address the lifelong sequential behavior incomprehension problem in large language models (LLMs) for recommendation, where LLMs struggle to extract useful information from long user behavior sequences, even within their context limits. To tackle this, we propose ReLLaX (Retrieval-enhanced Large Language models Plus), a framework offering optimization across data, prompt, and parameter levels. At the data level, we introduce Semantic User Behavior Retrieval (SUBR) to reduce sequence heterogeneity, making it easier for LLMs to extract key information. For prompt-level enhancement, we employ Soft Prompt Augmentation (SPA) to inject collaborative knowledge, aligning item representations with recommendation tasks and improving LLMs's exploration of item relationships. Finally, at the parameter level, we propose Component Fully-interactive LoRA (CFLoRA), which enhances LoRA's expressiveness by enabling interactions between its components, allowing better capture of sequential information. Moreover, we present new perspectives to compare current LoRA-based LLM4Rec methods, i.e. from both a composite and a decomposed view. We theoretically demonstrate that the ways they employ LoRA for recommendation are degraded versions of our CFLoRA, with different constraints on atom component interactions. Extensive experiments on three public datasets demonstrate ReLLaX's superiority over existing baselines and its ability to mitigate lifelong sequential behavior incomprehension effectively.
中文摘要:本文提出ReLLaX框架,通过语义检索减少序列异质性、软提示增强注入协同知识、以及改进的LoRA组件交互,有效解决大语言模型在推荐系统中难以理解长用户行为序列的问题。
English Summary: This paper introduces ReLLaX, a framework that addresses LLMs' difficulty in understanding long user behavior sequences for recommendation by optimizing data, prompts, and parameters through semantic retrieval, soft prompt augmentation, and enhanced LoRA interactions.

Authors:Hao Lang, Fei Huang, Yongbin Li
Title: Debate Helps Weak-to-Strong Generalization
Abstract:
Common methods for aligning already-capable models with desired behavior rely on the ability of humans to provide supervision. However, future superhuman models will surpass the capability of humans. Therefore, humans will only be able to weakly supervise superhuman models. This expected deficiency of human evaluation would weaken the safety of future AI systems. Scalable oversight and weak-to-strong generalization are two complementary approaches to tackle this issue. In this paper, we attempt to combine the strengths of these two approaches to further improve alignment. Specifically, we investigate ways of improving human supervision with a strong pretrained model and then supervise the strong model with enhanced weak human supervision. To make iterative empirical progress, we consider an analogy: can we use a strong model to improve weak model supervision and then use it to supervise the strong model? We empirically test it by finetuning a small weak model on ground truth labels with the additional help from a large strong model, and then finetuning the strong model on labels generated by the weak model. We find that debate can assist a weak model in extracting trustworthy information from an untrustworthy strong model, which provides leverage as context on samples when training a weak model. We also show that an ensemble of weak models helps exploit long arguments generated by strong model debaters and obtain a more robust supervision estimate. Extensive experiments on the OpenAI weak-to-strong NLP benchmarks show that the combination approach leads to better alignment, which indicates that debate has the potential to help weak-to-strong generalization.
中文: 本研究通过利用强模型增强弱人类监督,再以改进后的监督对齐更强模型,结合可扩展监督与弱到强泛化的方法,实验证明辩论机制和集成策略能有效提升人工智能对齐效果。
English: This research explores combining scalable oversight with weak-to-strong generalization by using strong models to enhance weak human supervision and then applying this improved supervision to align stronger models, demonstrating through debate mechanisms and ensemble methods that this approach significantly advances AI alignment.

Authors:Run Luo, Ting-En Lin, Haonan Zhang, Yuchuan Wu, Xiong Liu, Min Yang, Yongbin Li, Longze Chen, Jiaming Li, Lei Zhang, Xiaobo Xia, Hamid Alinejad-Rokny, Fei Huang
Title: OpenOmni: Advancing Open-Source Omnimodal Large Language Models with Progressive Multimodal Alignment and Real-Time Self-Aware Emotional Speech Synthesis
Abstract:
Recent advancements in omnimodal learning have significantly improved understanding and generation across images, text, and speech, yet these developments remain predominantly confined to proprietary models. The lack of high-quality omnimodal datasets and the challenges of real-time emotional speech synthesis have notably hindered progress in open-source research. To address these limitations, we introduce \name, a two-stage training framework that integrates omnimodal alignment and speech generation to develop a state-of-the-art omnimodal large language model. In the alignment phase, a pre-trained speech model undergoes further training on text-image tasks, enabling (near) zero-shot generalization from vision to speech, outperforming models trained on tri-modal datasets. In the speech generation phase, a lightweight decoder is trained on speech tasks with direct preference optimization, enabling real-time emotional speech synthesis with high fidelity. Experiments show that \name surpasses state-of-the-art models across omnimodal, vision-language, and speech-language benchmarks. It achieves a 4-point absolute improvement on OmniBench over the leading open-source model VITA, despite using 5x fewer training samples and a smaller model size (7B vs. 7x8B). Additionally, \name achieves real-time speech generation with <1s latency at non-autoregressive mode, reducing inference time by 5x compared to autoregressive methods, and improves emotion classification accuracy by 7.7\%
中文摘要:提出的 \name 框架通过两阶段训练方法克服了开源全模态学习的局限,在多个基准测试中实现最优性能,同时以显著降低的延迟和更高效率实现实时情感语音合成。
English Summary: The proposed \name framework overcomes limitations in open-source omnimodal learning by implementing a two-stage training approach that achieves state-of-the-art performance across multiple benchmarks while enabling real-time emotional speech synthesis with significantly reduced latency and improved efficiency.

Authors:Qingyao Ai, Jingtao Zhan, Yiqun Liu
Title: Foundations of GenIR
Abstract:
The chapter discusses the foundational impact of modern generative AI models on information access (IA) systems. In contrast to traditional AI, the large-scale training and superior data modeling of generative AI models enable them to produce high-quality, human-like responses, which brings brand new opportunities for the development of IA paradigms. In this chapter, we identify and introduce two of them in details, i.e., information generation and information synthesis. Information generation allows AI to create tailored content addressing user needs directly, enhancing user experience with immediate, relevant outputs. Information synthesis leverages the ability of generative AI to integrate and reorganize existing information, providing grounded responses and mitigating issues like model hallucination, which is particularly valuable in scenarios requiring precision and external knowledge. This chapter delves into the foundational aspects of generative models, including architecture, scaling, and training, and discusses their applications in multi-modal scenarios. Additionally, it examines the retrieval-augmented generation paradigm and other methods for corpus modeling and understanding, demonstrating how generative AI can enhance information access systems. It also summarizes potential challenges and fruitful directions for future studies.
中文摘要:本章探讨了现代生成式AI模型如何通过信息生成与合成技术,以高质量、拟人化的响应革新信息获取系统,同时深入分析了其基础架构、应用场景及未来挑战。
English Summary: This chapter explores how modern generative AI models revolutionize information access systems by enabling high-quality, human-like responses through information generation and synthesis, while also addressing their foundational aspects, applications, and future challenges.

Authors:Huanran Chen, Yinpeng Dong, Zeming Wei, Hang Su, Jun Zhu
Title: Towards the Worst-case Robustness of Large Language Models
Abstract:
Recent studies have revealed the vulnerability of large language models to adversarial attacks, where adversaries craft specific input sequences to induce harmful, violent, private, or incorrect outputs. In this work, we study their worst-case robustness, i.e., whether an adversarial example exists that leads to such undesirable outputs. We upper bound the worst-case robustness using stronger white-box attacks, indicating that most current deterministic defenses achieve nearly 0\% worst-case robustness. We propose a general tight lower bound for randomized smoothing using fractional knapsack solvers or 0-1 knapsack solvers, and using them to bound the worst-case robustness of all stochastic defenses. Based on these solvers, we provide theoretical lower bounds for several previous empirical defenses. For example, we certify the robustness of a specific case, smoothing using a uniform kernel, against \textit{any possible attack} with an average $\ell_0$ perturbation of 2.02 or an average suffix length of 6.41.
中文: 本研究揭示当前大多数确定性防御对大型语言模型的对抗攻击几乎不具备最坏情况下的鲁棒性,同时提出基于背包求解器的随机防御认证下界,确保能够抵御任何可能的攻击。
English: This study demonstrates that most current deterministic defenses for large language models have nearly zero worst-case robustness against adversarial attacks, while proposing a certified lower bound for stochastic defenses using knapsack solvers to ensure protection against any possible attack.

Authors:Huanran Chen, Yinpeng Dong, Zeming Wei, Hang Su, Jun Zhu
Title: Towards the Worst-case Robustness of Large Language Models
Abstract:
Recent studies have revealed the vulnerability of large language models to adversarial attacks, where adversaries craft specific input sequences to induce harmful, violent, private, or incorrect outputs. In this work, we study their worst-case robustness, i.e., whether an adversarial example exists that leads to such undesirable outputs. We upper bound the worst-case robustness using stronger white-box attacks, indicating that most current deterministic defenses achieve nearly 0\% worst-case robustness. We propose a general tight lower bound for randomized smoothing using fractional knapsack solvers or 0-1 knapsack solvers, and using them to bound the worst-case robustness of all stochastic defenses. Based on these solvers, we provide theoretical lower bounds for several previous empirical defenses. For example, we certify the robustness of a specific case, smoothing using a uniform kernel, against \textit{any possible attack} with an average $\ell_0$ perturbation of 2.02 or an average suffix length of 6.41.
中文: 本研究揭示当前大多数确定性防御对大型语言模型的对抗攻击几乎不具备最坏情况下的鲁棒性,同时提出基于背包求解器的随机防御认证下界,确保能够抵御任何可能的攻击。
English: This study demonstrates that most current deterministic defenses for large language models have nearly zero worst-case robustness against adversarial attacks, while proposing a certified lower bound for stochastic defenses using knapsack solvers to ensure protection against any possible attack.

Authors:Zhang Liu, Dusit Niyato, Jiacheng Wang, Geng Sun, Lianfen Huang, Zhibin Gao, Xianbin Wang
Title: Generative AI for Lyapunov Optimization Theory in UAV-based Low-Altitude Economy Networking
Abstract:
Lyapunov optimization theory has recently emerged as a powerful mathematical framework for solving complex stochastic optimization problems by transforming long-term objectives into a sequence of real-time short-term decisions while ensuring system stability. This theory is particularly valuable in unmanned aerial vehicle (UAV)-based low-altitude economy (LAE) networking scenarios, where it could effectively address inherent challenges of dynamic network conditions, multiple optimization objectives, and stability requirements. Recently, generative artificial intelligence (GenAI) has garnered significant attention for its unprecedented capability to generate diverse digital content. Extending beyond content generation, in this paper, we propose a framework integrating generative diffusion models with reinforcement learning to address Lyapunov optimization problems in UAV-based LAE networking. We begin by introducing the fundamentals of Lyapunov optimization theory and analyzing the limitations of both conventional methods and traditional AI-enabled approaches. We then examine various GenAI models and comprehensively analyze their potential contributions to Lyapunov optimization. Subsequently, we develop a Lyapunov-guided generative diffusion model-based reinforcement learning framework and validate its effectiveness through a UAV-based LAE networking case study. Finally, we outline several directions for future research.
Chinese: 本文提出了一种将生成扩散模型与强化学习相结合的新框架,用于解决无人机低空经济网络中的Lyapunov优化问题,在应对动态网络挑战的同时确保系统稳定性。
English: This paper proposes a novel framework that integrates generative diffusion models with reinforcement learning to solve Lyapunov optimization problems in UAV-based low-altitude economy networking, addressing dynamic network challenges while ensuring system stability.

Authors:Jiawei Huang, Aimin Wang, Geng Sun, Jiahui Li, Jiacheng Wang, Dusit Niyato, Victor C. M. Leung
Title: Low-altitude Friendly-Jamming for Satellite-Maritime Communications via Generative AI-enabled Deep Reinforcement Learning
Abstract:
Low Earth Orbit (LEO) satellites can be used to assist maritime wireless communications for data transmission across wide-ranging areas. However, extensive coverage of LEO satellites, combined with openness of channels, can cause the communication process to suffer from security risks. This paper presents a low-altitude friendly-jamming LEO satellite-maritime communication system enabled by a unmanned aerial vehicle (UAV) to ensure data security at the physical layer. Since such a system requires trade-off policies that balance the secrecy rate and energy consumption of the UAV to meet evolving scenario demands, we formulate a secure satellite-maritime communication multi-objective optimization problem (SSMCMOP). In order to solve the dynamic and long-term optimization problem, we reformulate it into a Markov decision process. We then propose a transformer-enhanced soft actor critic (TransSAC) algorithm, which is a generative artificial intelligence-enable deep reinforcement learning approach to solve the reformulated problem, so that capturing global dependencies and diversely exploring weights. Simulation results demonstrate that the TransSAC outperforms various baselines, and achieves an optimal secrecy rate while effectively minimizing the energy consumption of the UAV. Moreover, the results find more suitable constraint values for the system.
中文摘要:本文提出了一种无人机辅助的低空友好干扰低轨卫星-海事通信系统以增强数据安全,采用新型TransSAC深度强化学习算法,在最小化无人机能耗的同时实现了最优保密速率。
English Summary: This paper proposes a UAV-assisted low-altitude friendly-jamming LEO satellite-maritime communication system to enhance data security, employing a novel TransSAC deep reinforcement learning algorithm that effectively optimizes secrecy rates while minimizing UAV energy consumption.

Authors:Jianfei Sun, Qiang Gao, Cong Wu, Yuxian Li, Jiacheng Wang, Dusit Niyato
Title: Secure Resource Allocation via Constrained Deep Reinforcement Learning
Abstract:
The proliferation of Internet of Things (IoT) devices and the advent of 6G technologies have introduced computationally intensive tasks that often surpass the processing capabilities of user devices. Efficient and secure resource allocation in serverless multi-cloud edge computing environments is essential for supporting these demands and advancing distributed computing. However, existing solutions frequently struggle with the complexity of multi-cloud infrastructures, robust security integration, and effective application of traditional deep reinforcement learning (DRL) techniques under system constraints. To address these challenges, we present SARMTO, a novel framework that integrates an action-constrained DRL model. SARMTO dynamically balances resource allocation, task offloading, security, and performance by utilizing a Markov decision process formulation, an adaptive security mechanism, and sophisticated optimization techniques. Extensive simulations across varying scenarios, including different task loads, data sizes, and MEC capacities, show that SARMTO consistently outperforms five baseline approaches, achieving up to a 40% reduction in system costs and a 41.5% improvement in energy efficiency over state-of-the-art methods. These enhancements highlight SARMTO's potential to revolutionize resource management in intricate distributed computing environments, opening the door to more efficient and secure IoT and edge computing applications.
中文摘要:SARMTO框架采用动作约束深度强化学习优化多云边缘计算中的资源分配与安全机制,在物联网和6G应用中实现了系统成本大幅降低与能效显著提升。
English Summary: The SARMTO framework utilizes action-constrained deep reinforcement learning to optimize resource allocation and security in multi-cloud edge computing, demonstrating significant improvements in cost reduction and energy efficiency for IoT and 6G applications.

Authors:Yangning Li, Hui Kang, Jiahui Li, Geng Sun, Zemin Sun, Jiacheng Wang, Changyuan Zhao, Dusit Niyato
Title: A Correlated Data-Driven Collaborative Beamforming Approach for Energy-efficient IoT Data Transmission
Abstract:
An expansion of Internet of Things (IoTs) has led to significant challenges in wireless data harvesting, dissemination, and energy management due to the massive volumes of data generated by IoT devices. These challenges are exacerbated by data redundancy arising from spatial and temporal correlations. To address these issues, this paper proposes a novel data-driven collaborative beamforming (CB)-based communication framework for IoT networks. Specifically, the framework integrates CB with an overlap-based multi-hop routing protocol (OMRP) to enhance data transmission efficiency while mitigating energy consumption and addressing hot spot issues in remotely deployed IoT networks. Based on the data aggregation to a specific node by OMRP, we formulate a node selection problem for the CB stage, with the objective of optimizing uplink transmission energy consumption. Given the complexity of the problem, we introduce a softmax-based proximal policy optimization with long short-term memory (SoftPPO-LSTM) algorithm to intelligently select CB nodes for improving transmission efficiency. Simulation results validate the effectiveness of the proposed OMRP and SoftPPO-LSTM methods, demonstrating significant improvements over existing routing protocols and node selection strategies. The results also reveal that the combined OMRP with the SoftPPO-LSTM method effectively mitigates hot spot problems and offers superior performance compared to traditional strategies.
中文: 本文提出一种结合重叠多跳路由协议与协作波束成形的通信框架,通过SoftPPO-LSTM算法智能选择节点,有效提升物联网数据传输效率、降低能耗并缓解热点问题。
English: This paper proposes a collaborative beamforming framework integrated with an overlap-based multi-hop routing protocol to enhance IoT data transmission efficiency, reduce energy consumption, and address hot spot issues, utilizing a novel SoftPPO-LSTM algorithm for intelligent node selection.

Authors:Geng Sun, Weilong Ma, Jiahui Li, Zemin Sun, Jiacheng Wang, Dusit Niyato, Shiwen Mao
Title: Task Delay and Energy Consumption Minimization for Low-altitude MEC via Evolutionary Multi-objective Deep Reinforcement Learning
Abstract:
The low-altitude economy (LAE), driven by unmanned aerial vehicles (UAVs) and other aircraft, has revolutionized fields such as transportation, agriculture, and environmental monitoring. In the upcoming six-generation (6G) era, UAV-assisted mobile edge computing (MEC) is particularly crucial in challenging environments such as mountainous or disaster-stricken areas. The computation task offloading problem is one of the key issues in UAV-assisted MEC, primarily addressing the trade-off between minimizing the task delay and the energy consumption of the UAV. In this paper, we consider a UAV-assisted MEC system where the UAV carries the edge servers to facilitate task offloading for ground devices (GDs), and formulate a calculation delay and energy consumption multi-objective optimization problem (CDECMOP) to simultaneously improve the performance and reduce the cost of the system. Then, by modeling the formulated problem as a multi-objective Markov decision process (MOMDP), we propose a multi-objective deep reinforcement learning (DRL) algorithm within an evolutionary framework to dynamically adjust the weights and obtain non-dominated policies. Moreover, to ensure stable convergence and improve performance, we incorporate a target distribution learning (TDL) algorithm. Simulation results demonstrate that the proposed algorithm can better balance multiple optimization objectives and obtain superior non-dominated solutions compared to other methods.
中文摘要:本文提出了一种进化框架下的多目标深度强化学习算法,用于优化无人机辅助移动边缘计算系统中的任务延迟与能耗,仿真结果表明该方法能更好地平衡多个优化目标并获得更优的非支配解。
English Summary: The paper proposes a multi-objective deep reinforcement learning algorithm within an evolutionary framework to optimize task delay and energy consumption in UAV-assisted mobile edge computing systems, demonstrating superior performance through simulations.

Authors:Xiaoya Zheng, Geng Sun, Jiahui Li, Jiacheng Wang, Qingqing Wu, Dusit Niyato, Abbas Jamalipour
Title: UAV Swarm-enabled Collaborative Post-disaster Communications in Low Altitude Economy via a Two-stage Optimization Approach
Abstract:
The low-altitude economy (LAE) plays an indispensable role in cargo transportation, healthcare, infrastructure inspection, and especially post-disaster communication. Specifically, unmanned aerial vehicles (UAVs), as one of the core technologies of the LAE, can be deployed to provide communication coverage, facilitate data collection, and relay data for trapped users, thereby significantly enhancing the efficiency of post-disaster response efforts. In this paper, we design an efficient and robust UAV-swarm enabled collaborative self-organizing network to facilitate post-disaster communications. Specifically, a ground device transmits data to UAV swarms, which then use collaborative beamforming (CB) technique to form virtual antenna arrays and relay the data to a remote access point (AP) efficiently. Then, we formulate a rescue-oriented post-disaster transmission rate maximization optimization problem (RPTRMOP). Then, we propose a two-stage optimization approach to address it. In the first stage, the optimal traffic routing and the theoretical upper bound on the transmission rate of the network are derived. In the second stage, we transform the formulated RPTRMOP into a variant named V-RPTRMOP, and a diffusion model-enabled particle swarm optimization (DM-PSO) algorithm is proposed to deal with the V-RPTRMOP. Simulation results show the effectiveness of the proposed two-stage optimization approach in improving the transmission rate of the constructed network, which demonstrates the great potential for post-disaster communications. Moreover, the robustness of the constructed network is also validated via evaluating the impact of two unexpected situations on the system transmission rate.
中文: 本文设计了一种无人机群协同自组织网络,通过两阶段优化方法最大化灾后传输速率,显著提升了通信效率和网络鲁棒性。
English: This paper proposes a UAV-swarm enabled collaborative self-organizing network using a two-stage optimization approach to maximize post-disaster transmission rates, demonstrating significant improvements in communication efficiency and robustness.

Authors:Yiqun Chen, Lingyong Yan, Weiwei Sun, Xinyu Ma, Yi Zhang, Shuaiqiang Wang, Dawei Yin, Yiming Yang, Jiaxin Mao
Title: Improving Retrieval-Augmented Generation through Multi-Agent Reinforcement Learning
Abstract:
Retrieval-augmented generation (RAG) is widely utilized to incorporate external knowledge into large language models, thereby enhancing factuality and reducing hallucinations in question-answering (QA) tasks. A standard RAG pipeline consists of several components, such as query rewriting, document retrieval, document filtering, and answer generation. However, these components are typically optimized separately through supervised fine-tuning, which can lead to misalignments between the objectives of individual components and the overarching aim of generating accurate answers. Although recent efforts have explored using reinforcement learning (RL) to optimize specific RAG components, these approaches often focus on simple pipelines with only two components or do not adequately address the complex interdependencies and collaborative interactions among the modules. To overcome these limitations, we propose treating the complex RAG pipeline with multiple components as a multi-agent cooperative task, in which each component can be regarded as an RL agent. Specifically, we present MMOA-RAG, Multi-Module joint Optimization Algorithm for RAG, which employs multi-agent reinforcement learning to harmonize all agents' goals toward a unified reward, such as the F1 score of the final answer. Experiments conducted on various QA benchmarks demonstrate that MMOA-RAG effectively boost the overall performance of the pipeline and outperforms existing baselines. Furthermore, comprehensive ablation studies validate the contributions of individual components and demonstrate MMOA-RAG can be adapted to different RAG pipelines and benchmarks.
Chinese: 本文提出MMOA-RAG,通过多智能体强化学习将检索增强生成流程中的各个组件作为智能体进行联合优化,使所有模块目标统一于最终答案的F1分数等奖励指标,在多项问答任务中超越了现有基线方法。
English: The paper introduces MMOA-RAG, a multi-agent reinforcement learning approach that optimizes the entire retrieval-augmented generation pipeline by aligning all components toward a unified reward, improving performance on question-answering tasks over existing methods.

Authors:Dongsheng Zhu, Weixian Shi, Zhengliang Shi, Zhaochun Ren, Shuaiqiang Wang, Lingyong Yan, Dawei Yin
Title: Divide-Then-Aggregate: An Efficient Tool Learning Method via Parallel Tool Invocation
Abstract:
Although current Large Language Models (LLMs) exhibit impressive capabilities, performing complex real-world tasks still requires tool learning. Mainstream methods, such as CoT/ReAct, rely on step-by-step tool invocation to interact with external environments, but they are limited in perceptual scope and lack adequate task-planning capability. To address these limitations, other studies introduce the first Search-based Decision Tree (DFSDT), which still suffers from the high computational cost. In this paper, we introduce a novel parallel tool invocation paradigm, DTA-Llama (Divide-Then-Aggregate Llama). First, we transform traditional tree-based tool search paths into Directed Acyclic Graph (DAG) structure, generating a high-quality parallel tool invocation dataset. The DTA-Llama is then trained on the dataset to learn to iteratively divide the current task into several parallel tool invocation sub-tasks and aggregate the invocation results to decide the next actions. Furthermore, we introduce an efficient inference framework inspired by the Process/Threads mechanism when applying the DTA-Llama to practical tasks. Experimental results show that our approach substantially enhances task performance while reducing token consumption and inference time. Llama2-7B, using our method, is comparable to the official parallel function calling method of GPT-3.5. The relevant code, dataset, and model weights are available at https://corn0205.github.io/
中文摘要:本文提出DTA-Llama这一新型并行工具调用范式,将传统树状工具搜索路径转化为有向无环图结构,通过迭代分解任务与聚合结果实现更高效的工具学习,在提升任务性能的同时显著降低了计算开销。
English Summary: This paper introduces DTA-Llama, a novel parallel tool invocation paradigm that transforms traditional tree-based tool search into a DAG structure, enabling more efficient task division and result aggregation while significantly improving performance and reducing computational costs.

Authors:Shijie Wang, Wenqi Fan, Yue Feng, Shanru Lin, Xinyu Ma, Shuaiqiang Wang, Dawei Yin
Title: Knowledge Graph Retrieval-Augmented Generation for LLM-based Recommendation
Abstract:
Recommender systems have become increasingly vital in our daily lives, helping to alleviate the problem of information overload across various user-oriented online services. The emergence of Large Language Models (LLMs) has yielded remarkable achievements, demonstrating their potential for the development of next-generation recommender systems. Despite these advancements, LLM-based recommender systems face inherent limitations stemming from their LLM backbones, particularly issues of hallucinations and the lack of up-to-date and domain-specific knowledge. Recently, Retrieval-Augmented Generation (RAG) has garnered significant attention for addressing these limitations by leveraging external knowledge sources to enhance the understanding and generation of LLMs. However, vanilla RAG methods often introduce noise and neglect structural relationships in knowledge, limiting their effectiveness in LLM-based recommendations. To address these limitations, we propose to retrieve high-quality and up-to-date structure information from the knowledge graph (KG) to augment recommendations. Specifically, our approach develops a retrieval-augmented framework, termed K-RagRec, that facilitates the recommendation generation process by incorporating structure information from the external KG. Extensive experiments have been conducted to demonstrate the effectiveness of our proposed method.
中文:提出的K-RagRec框架通过整合知识图谱中的结构化信息,有效解决了基于大语言模型的推荐系统中存在的幻觉和知识陈旧问题,实验证明其性能优于传统方法。
English: The proposed K-RagRec framework enhances LLM-based recommender systems by integrating structured knowledge from knowledge graphs to address limitations like hallucinations and outdated information, outperforming traditional methods in experiments.

Authors:Fenglin Yu, Fangkai Yang, Xiaoting Qin, Zhiyang Zhang, Jue Zhang, Qingwei Lin, Hongyu Zhang, Yingnong Dang, Saravan Rajmohan, Dongmei Zhang, Qi Zhang
Title: Enabling Autonomic Microservice Management through Self-Learning Agents
Abstract:
The increasing complexity of modern software systems necessitates robust autonomic self-management capabilities. While Large Language Models (LLMs) demonstrate potential in this domain, they often face challenges in adapting their general knowledge to specific service contexts. To address this limitation, we propose ServiceOdyssey, a self-learning agent system that autonomously manages microservices without requiring prior knowledge of service-specific configurations. By leveraging curriculum learning principles and iterative exploration, ServiceOdyssey progressively develops a deep understanding of operational environments, reducing dependence on human input or static documentation. A prototype built with the Sock Shop microservice demonstrates the potential of this approach for autonomic microservice management.
中文: ServiceOdyssey是一种自学习代理系统,通过课程学习和迭代探索自主管理微服务,逐步适应特定操作环境,减少对人类输入的依赖。
English: ServiceOdyssey is a self-learning agent system that autonomously manages microservices by using curriculum learning and iterative exploration to adapt to specific operational environments, reducing reliance on human input.

Authors:Minghua He, Fangkai Yang, Pu Zhao, Wenjie Yin, Yu Kang, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, Qi Zhang
Title: ExeCoder: Empowering Large Language Models with Executability Representation for Code Translation
Abstract:
Code translation is a crucial activity in the software development and maintenance process, and researchers have recently begun to focus on using pre-trained large language models (LLMs) for code translation. However, existing LLMs only learn the contextual semantics of code during pre-training, neglecting executability information closely related to the execution state of the code, which results in unguaranteed code executability and unreliable automated code translation. To address this issue, we propose ExeCoder, an LLM specifically designed for code translation, aimed at utilizing executability representations such as functional semantics, syntax structures, and variable dependencies to enhance the capabilities of LLMs in code translation. To evaluate the effectiveness of ExeCoder, we manually enhanced the widely used benchmark TransCoder-test, resulting in a benchmark called TransCoder-test-X that serves LLMs. Evaluation of TransCoder-test-X indicates that ExeCoder achieves state-of-the-art performance in code translation, surpassing existing open-source code LLMs by over 10.88% to 38.78% and over 27.44% to 42.97% on two metrics, and even outperforms the renowned closed-source LLM GPT-4o. Website: https://execoder4trans.github.io/
Chinese: 研究人员开发了ExeCoder,这是一种专门用于代码翻译的大语言模型,通过整合功能语义、语法结构等可执行性信息来提升翻译能力,在性能上超越了GPT-4o等模型,达到了最先进水平。
English: Researchers have developed ExeCoder, a large language model that incorporates executability information like functional semantics and syntax to improve code translation, achieving state-of-the-art results and surpassing models like GPT-4o in performance.

Authors:Minghua He, Yue Chen, Fangkai Yang, Pu Zhao, Wenjie Yin, Yu Kang, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang
Title: ExeCoder: Empowering Large Language Models with Executability Representation for Code Translation
Abstract:
Code translation is a crucial activity in the software development and maintenance process, and researchers have recently begun to focus on using pre-trained large language models (LLMs) for code translation. However, existing LLMs only learn the contextual semantics of code during pre-training, neglecting executability information closely related to the execution state of the code, which results in unguaranteed code executability and unreliable automated code translation. To address this issue, we propose ExeCoder, an LLM specifically designed for code translation, aimed at utilizing executability representations such as functional semantics, syntax structures, and variable dependencies to enhance the capabilities of LLMs in code translation. To evaluate the effectiveness of ExeCoder, we manually enhanced the widely used benchmark TransCoder-test, resulting in a benchmark called TransCoder-test-X that serves LLMs. Evaluation of TransCoder-test-X indicates that ExeCoder achieves state-of-the-art performance in code translation, surpassing existing open-source code LLMs by over 10.88% to 38.78% and over 27.44% to 42.97% on two metrics, and even outperforms the renowned closed-source LLM GPT-4o. Code is available at https://aka.ms/execoder
Chinese: 研究人员开发了ExeCoder,这是一种专门用于代码翻译的大语言模型,通过整合功能语义、语法结构等可执行性信息来提升翻译能力,在性能上超越了GPT-4o等模型,达到了最先进水平。
English: Researchers have developed ExeCoder, a large language model that incorporates executability information like functional semantics and syntax to improve code translation, achieving state-of-the-art results and surpassing models like GPT-4o in performance.

Authors:Xing Zhang, Jiaheng Wen, Fangkai Yang, Pu Zhao, Yu Kang, Junhao Wang, Maoquan Wang, Yufan Huang, Elsie Nallipogu, Qingwei Lin, Yingnong Dang, Saravan Rajmohan, Dongmei Zhang, Qi Zhang
Title: Skeleton-Guided-Translation: A Benchmarking Framework for Code Repository Translation with Fine-Grained Quality Evaluation
Abstract:
The advancement of large language models has intensified the need to modernize enterprise applications and migrate legacy systems to secure, versatile languages. However, existing code translation benchmarks primarily focus on individual functions, overlooking the complexities involved in translating entire repositories, such as maintaining inter-module coherence and managing dependencies. While some recent repository-level translation benchmarks attempt to address these challenges, they still face limitations, including poor maintainability and overly coarse evaluation granularity, which make them less developer-friendly. We introduce Skeleton-Guided-Translation, a framework for repository-level Java to C# code translation with fine-grained quality evaluation. It uses a two-step process: first translating the repository's structural "skeletons", then translating the full repository guided by these skeletons. Building on this, we present TRANSREPO-BENCH, a benchmark of high quality open-source Java repositories and their corresponding C# skeletons, including matching unit tests and build configurations. Our unit tests are fixed and can be applied across multiple or incremental translations without manual adjustments, enhancing automation and scalability in evaluations. Additionally, we develop fine-grained evaluation metrics that assess translation quality at the individual test case level, addressing traditional binary metrics' inability to distinguish when build failures cause all tests to fail. Evaluations using TRANSREPO-BENCH highlight key challenges and advance more accurate repository level code translation.
大语言模型推动企业现代化,但现有代码翻译基准无法解决仓库级复杂性问题;我们的骨架引导翻译框架通过结构骨架映射和细粒度评估克服了这些限制。
Large language models are driving enterprise modernization, but current code translation benchmarks fail to address repository-level complexities like dependency management and module coherence; our Skeleton-Guided-Translation framework overcomes these limitations through structural skeleton mapping and fine-grained evaluation with TRANSREPO-BENCH.

Authors:Linghao Zhang, Junhao Wang, Shilin He, Chaoyun Zhang, Yu Kang, Bowen Li, Jiaheng Wen, Chengxing Xie, Maoquan Wang, Yufan Huang, Elsie Nallipogu, Qingwei Lin, Yingnong Dang, Saravan Rajmohan, Dongmei Zhang, Qi Zhang
Title: DI-BENCH: Benchmarking Large Language Models on Dependency Inference with Testable Repositories at Scale
Abstract:
Large Language Models have advanced automated software development, however, it remains a challenge to correctly infer dependencies, namely, identifying the internal components and external packages required for a repository to successfully run. Existing studies highlight that dependency-related issues cause over 40\% of observed runtime errors on the generated repository. To address this, we introduce DI-BENCH, a large-scale benchmark and evaluation framework specifically designed to assess LLMs' capability on dependency inference. The benchmark features 581 repositories with testing environments across Python, C#, Rust, and JavaScript. Extensive experiments with textual and execution-based metrics reveal that the current best-performing model achieves only a 42.9% execution pass rate, indicating significant room for improvement. DI-BENCH establishes a new viewpoint for evaluating LLM performance on repositories, paving the way for more robust end-to-end software synthesis.
中文: DI-BENCH 是一个专为评估大语言模型依赖推断能力而设计的大规模基准测试框架,覆盖多种编程语言,实验显示当前最佳模型的执行通过率仅为42.9%,表明自动化软件开发仍有巨大提升空间。
English: DI-BENCH is a large-scale benchmark designed to evaluate LLMs' dependency inference capabilities across multiple programming languages, revealing that current models achieve only a 42.9% execution pass rate and highlighting substantial room for improvement in automated software development.

Authors:Yanjiang Liu, Shuhen Zhou, Yaojie Lu, Huijia Zhu, Weiqiang Wang, Hongyu Lin, Ben He, Xianpei Han, Le Sun
Title: Auto-RT: Automatic Jailbreak Strategy Exploration for Red-Teaming Large Language Models
Abstract:
Automated red-teaming has become a crucial approach for uncovering vulnerabilities in large language models (LLMs). However, most existing methods focus on isolated safety flaws, limiting their ability to adapt to dynamic defenses and uncover complex vulnerabilities efficiently. To address this challenge, we propose Auto-RT, a reinforcement learning framework that automatically explores and optimizes complex attack strategies to effectively uncover security vulnerabilities through malicious queries. Specifically, we introduce two key mechanisms to reduce exploration complexity and improve strategy optimization: 1) Early-terminated Exploration, which accelerate exploration by focusing on high-potential attack strategies; and 2) Progressive Reward Tracking algorithm with intermediate downgrade models, which dynamically refine the search trajectory toward successful vulnerability exploitation. Extensive experiments across diverse LLMs demonstrate that, by significantly improving exploration efficiency and automatically optimizing attack strategies, Auto-RT detects a boarder range of vulnerabilities, achieving a faster detection speed and 16.63\% higher success rates compared to existing methods.
Chinese: Auto-RT是一种强化学习框架,通过高效探索和优化复杂攻击策略,显著提升了大型语言模型的自动化红队测试能力,实现了更快的漏洞检测速度和16.63%的成功率提升。
English: Auto-RT is a reinforcement learning framework that enhances automated red-teaming by efficiently exploring and optimizing complex attack strategies, achieving faster vulnerability detection and 16.63% higher success rates in large language models.

Authors:Qiang Liu, Xinlong Chen, Yue Ding, Bowen Song, Weiqiang Wang, Shu Wu, Liang Wang
Title: Attention-guided Self-reflection for Zero-shot Hallucination Detection in Large Language Models
Abstract:
Hallucination has emerged as a significant barrier to the effective application of Large Language Models (LLMs). In this work, we introduce a novel Attention-Guided SElf-Reflection (AGSER) approach for zero-shot hallucination detection in LLMs. The AGSER method utilizes attention contributions to categorize the input query into attentive and non-attentive queries. Each query is then processed separately through the LLMs, allowing us to compute consistency scores between the generated responses and the original answer. The difference between the two consistency scores serves as a hallucination estimator. In addition to its efficacy in detecting hallucinations, AGSER notably reduces computational overhead, requiring only three passes through the LLM and utilizing two sets of tokens. We have conducted extensive experiments with four widely-used LLMs across three different hallucination benchmarks, demonstrating that our approach significantly outperforms existing methods in zero-shot hallucination detection.
中文: AGSER方法通过基于注意力的查询分类和响应一致性分析,有效检测大语言模型中的幻觉问题,在显著优于现有方法的同时降低了计算开销。
English: The AGSER method effectively detects hallucinations in LLMs by analyzing attention-based query categorization and response consistency, significantly outperforming existing approaches while reducing computational costs.

Authors:Fengli Xu, Qianyue Hao, Zefang Zong, Jingwei Wang, Yunke Zhang, Jingyi Wang, Xiaochong Lan, Jiahui Gong, Tianjian Ouyang, Fanjin Meng, Chenyang Shao, Yuwei Yan, Qinglong Yang, Yiwen Song, Sijian Ren, Xinyuan Hu, Yu Li, Jie Feng, Chen Gao, Yong Li
Title: Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models
Abstract:
Language has long been conceived as an essential tool for human reasoning. The breakthrough of Large Language Models (LLMs) has sparked significant research interest in leveraging these models to tackle complex reasoning tasks. Researchers have moved beyond simple autoregressive token generation by introducing the concept of "thought" -- a sequence of tokens representing intermediate steps in the reasoning process. This innovative paradigm enables LLMs' to mimic complex human reasoning processes, such as tree search and reflective thinking. Recently, an emerging trend of learning to reason has applied reinforcement learning (RL) to train LLMs to master reasoning processes. This approach enables the automatic generation of high-quality reasoning trajectories through trial-and-error search algorithms, significantly expanding LLMs' reasoning capacity by providing substantially more training data. Furthermore, recent studies demonstrate that encouraging LLMs to "think" with more tokens during test-time inference can further significantly boost reasoning accuracy. Therefore, the train-time and test-time scaling combined to show a new research frontier -- a path toward Large Reasoning Model. The introduction of OpenAI's o1 series marks a significant milestone in this research direction. In this survey, we present a comprehensive review of recent progress in LLM reasoning. We begin by introducing the foundational background of LLMs and then explore the key technical components driving the development of large reasoning models, with a focus on automated data construction, learning-to-reason techniques, and test-time scaling. We also analyze popular open-source projects at building large reasoning models, and conclude with open challenges and future research directions.
中文: 大型语言模型通过引入“思维”过程和强化学习,超越了简单的自回归标记生成,显著提升了推理能力,标志着向大型推理模型发展的重要进展。
English: Large Language Models are advancing beyond simple token generation by incorporating "thought" processes and reinforcement learning to enhance reasoning capabilities, marking a significant step toward developing Large Reasoning Models.

Authors:Huandong Wang, Wenjie Fu, Yingzhou Tang, Zhilong Chen, Yuxi Huang, Jinghua Piao, Chen Gao, Fengli Xu, Tao Jiang, Yong Li
Title: A Survey on Responsible LLMs: Inherent Risk, Malicious Use, and Mitigation Strategy
Abstract:
While large language models (LLMs) present significant potential for supporting numerous real-world applications and delivering positive social impacts, they still face significant challenges in terms of the inherent risk of privacy leakage, hallucinated outputs, and value misalignment, and can be maliciously used for generating toxic content and unethical purposes after been jailbroken. Therefore, in this survey, we present a comprehensive review of recent advancements aimed at mitigating these issues, organized across the four phases of LLM development and usage: data collecting and pre-training, fine-tuning and alignment, prompting and reasoning, and post-processing and auditing. We elaborate on the recent advances for enhancing the performance of LLMs in terms of privacy protection, hallucination reduction, value alignment, toxicity elimination, and jailbreak defenses. In contrast to previous surveys that focus on a single dimension of responsible LLMs, this survey presents a unified framework that encompasses these diverse dimensions, providing a comprehensive view of enhancing LLMs to better serve real-world applications.
中文: 本综述提出了一个统一框架,涵盖大语言模型在隐私保护、幻觉减少、价值对齐及安全防御等方面的最新进展,旨在全面提升其实际应用能力。
English: This survey provides a unified framework addressing privacy risks, hallucinations, value misalignment, and security vulnerabilities in large language models across all development stages to enhance their real-world applicability.

Authors:Jinghua Piao, Zhihong Lu, Chen Gao, Fengli Xu, Qinghua Hu, Fernando P. Santos, Yong Li, James Evans
Title: Emergence of human-like polarization among large language model agents
Abstract:
Rapid advances in large language models (LLMs) have not only empowered autonomous agents to generate social networks, communicate, and form shared and diverging opinions on political issues, but have also begun to play a growing role in shaping human political deliberation. Our understanding of their collective behaviours and underlying mechanisms remains incomplete, however, posing unexpected risks to human society. In this paper, we simulate a networked system involving thousands of large language model agents, discovering their social interactions, guided through LLM conversation, result in human-like polarization. We discover that these agents spontaneously develop their own social network with human-like properties, including homophilic clustering, but also shape their collective opinions through mechanisms observed in the real world, including the echo chamber effect. Similarities between humans and LLM agents -- encompassing behaviours, mechanisms, and emergent phenomena -- raise concerns about their capacity to amplify societal polarization, but also hold the potential to serve as a valuable testbed for identifying plausible strategies to mitigate polarization and its consequences.
中文:大语言模型代理能自主形成社交网络并通过回音室等机制产生类人类极化现象,既可能加剧社会分裂风险,也可作为研究消解极化策略的重要实验平台。
English: Large language model agents can autonomously form social networks and exhibit human-like polarization through mechanisms such as echo chambers, posing risks of amplifying societal divides while offering a testbed for mitigation strategies.

Authors:Ye Fang, Zeyi Sun, Shangzhan Zhang, Tong Wu, Yinghao Xu, Pan Zhang, Jiaqi Wang, Gordon Wetzstein, Dahua Lin
Title: RelightVid: Temporal-Consistent Diffusion Model for Video Relighting
Abstract:
Diffusion models have demonstrated remarkable success in image generation and editing, with recent advancements enabling albedo-preserving image relighting. However, applying these models to video relighting remains challenging due to the lack of paired video relighting datasets and the high demands for output fidelity and temporal consistency, further complicated by the inherent randomness of diffusion models. To address these challenges, we introduce RelightVid, a flexible framework for video relighting that can accept background video, text prompts, or environment maps as relighting conditions. Trained on in-the-wild videos with carefully designed illumination augmentations and rendered videos under extreme dynamic lighting, RelightVid achieves arbitrary video relighting with high temporal consistency without intrinsic decomposition while preserving the illumination priors of its image backbone.
中文总结:RelightVid是一种新颖的视频重照明框架,能够利用多种输入条件实现高保真度的视频光照调整,同时保持时间一致性并继承图像模型的照明先验知识。
English Summary: RelightVid is a novel framework that enables high-fidelity video relighting using various input conditions while maintaining temporal consistency and leveraging illumination priors from image models.

Authors:Qi Ma, Runyi Yang, Bin Ren, Nicu Sebe, Ender Konukoglu, Luc Van Gool, Danda Pani Paudel
Title: CityLoc: 6DoF Pose Distributional Localization for Text Descriptions in Large-Scale Scenes with Gaussian Representation
Abstract:
Localizing textual descriptions within large-scale 3D scenes presents inherent ambiguities, such as identifying all traffic lights in a city. Addressing this, we introduce a method to generate distributions of camera poses conditioned on textual descriptions, facilitating robust reasoning for broadly defined concepts. Our approach employs a diffusion-based architecture to refine noisy 6DoF camera poses towards plausible locations, with conditional signals derived from pre-trained text encoders. Integration with the pretrained Vision-Language Model, CLIP, establishes a strong linkage between text descriptions and pose distributions. Enhancement of localization accuracy is achieved by rendering candidate poses using 3D Gaussian splatting, which corrects misaligned samples through visual reasoning. We validate our method's superiority by comparing it against standard distribution estimation methods across five large-scale datasets, demonstrating consistent outperformance. Code, datasets and more information will be publicly available at our project page.
中文摘要:本文提出一种基于扩散模型的方法,通过文本描述生成相机位姿分布,结合CLIP模型与3D高斯溅射技术提升3D场景中的定位精度,并在多个数据集上验证了其优越性能。
English Summary: This paper introduces a diffusion-based method to generate camera pose distributions from textual descriptions, enhancing localization accuracy in 3D scenes through CLIP integration and 3D Gaussian splatting, and demonstrates superior performance across multiple datasets.

Authors:Yuwei Zhang, Zhi Jin, Ying Xing, Ge Li, Fang Liu, Jiaxin Zhu, Wensheng Dou, Jun Wei
Title: PATCH: Empowering Large Language Model with Programmer-Intent Guidance and Collaborative-Behavior Simulation for Automatic Bug Fixing
Abstract:
Bug fixing holds significant importance in software development and maintenance. Recent research has made substantial strides in exploring the potential of large language models (LLMs) for automatically resolving software bugs. However, a noticeable gap in existing approaches lies in the oversight of collaborative facets intrinsic to bug resolution, treating the process as a single-stage endeavor. Moreover, most approaches solely take the buggy code snippet as input for LLMs during the patch generation stage. To mitigate the aforementioned limitations, we introduce a novel stage-wise framework named PATCH. Specifically, we first augment the buggy code snippet with corresponding dependence context and intent information to better guide LLMs in generating the correct candidate patches. Additionally, by taking inspiration from bug management practices, we decompose the bug-fixing task into four distinct stages: bug reporting, bug diagnosis, patch generation, and patch verification. These stages are performed interactively by LLMs, aiming to simulate the collaborative behavior of programmers during the resolution of software bugs. By harnessing these collective contributions, PATCH effectively enhances the bug-fixing capability of LLMs. We implement PATCH by employing the powerful dialogue-based LLM ChatGPT. Our evaluation on the widely used bug-fixing benchmark BFP demonstrates that PATCH has achieved better performance than state-of-the-art LLMs.
中文: PATCH框架通过将依赖上下文和意图信息融入代码片段,并将修复过程分解为四个交互阶段,显著提升了大型语言模型的缺陷修复能力,在BFP基准测试中表现优异。
English: The PATCH framework enhances LLMs' bug-fixing capabilities by incorporating dependency context and intent information into code snippets and decomposing the process into four interactive stages, achieving superior performance on the BFP benchmark.

Authors:Xiaolong Wang, Yuanchi Zhang, Ziyue Wang, Yuzhuang Xu, Fuwen Luo, Yile Wang, Peng Li, Yang Liu
Title: Perspective Transition of Large Language Models for Solving Subjective Tasks
Abstract:
Large language models (LLMs) have revolutionized the field of natural language processing, enabling remarkable progress in various tasks. Different from objective tasks such as commonsense reasoning and arithmetic question-answering, the performance of LLMs on subjective tasks is still limited, where the perspective on the specific problem plays crucial roles for better interpreting the context and giving proper response. For example, in certain scenarios, LLMs may perform better when answering from an expert role perspective, potentially eliciting their relevant domain knowledge. In contrast, in some scenarios, LLMs may provide more accurate responses when answering from a third-person standpoint, enabling a more comprehensive understanding of the problem and potentially mitigating inherent biases. In this paper, we propose Reasoning through Perspective Transition (RPT), a method based on in-context learning that enables LLMs to dynamically select among direct, role, and third-person perspectives for the best way to solve corresponding subjective problem. Through extensive experiments on totally 12 subjective tasks by using both closed-source and open-source LLMs including GPT-4, GPT-3.5, Llama-3, and Qwen-2, our method outperforms widely used single fixed perspective based methods such as chain-of-thought prompting and expert prompting, highlights the intricate ways that LLMs can adapt their perspectives to provide nuanced and contextually appropriate responses for different problems.
Chinese Summary: 提出的视角转换推理方法使大语言模型能够动态选择最优视角来解决主观任务,在多种模型和任务中均优于固定视角方法。
English Summary: The proposed Reasoning through Perspective Transition (RPT) method enables large language models to dynamically select optimal perspectives for solving subjective tasks, outperforming fixed-perspective approaches across multiple models and tasks.

Authors:Yuchun Miao, Sen Zhang, Liang Ding, Yuqi Zhang, Lefei Zhang, Dacheng Tao
Title: The Energy Loss Phenomenon in RLHF: A New Perspective on Mitigating Reward Hacking
Abstract:
This work identifies the Energy Loss Phenomenon in Reinforcement Learning from Human Feedback (RLHF) and its connection to reward hacking. Specifically, energy loss in the final layer of a Large Language Model (LLM) gradually increases during the RL process, with an excessive increase in energy loss characterizing reward hacking. Beyond empirical analysis, we further provide a theoretical foundation by proving that, under mild conditions, the increased energy loss reduces the upper bound of contextual relevance in LLMs, which is a critical aspect of reward hacking as the reduced contextual relevance typically indicates overfitting to reward model-favored patterns in RL. To address this issue, we propose an Energy loss-aware PPO algorithm (EPPO) which penalizes the increase in energy loss in the LLM's final layer during reward calculation to prevent excessive energy loss, thereby mitigating reward hacking. We theoretically show that EPPO can be conceptually interpreted as an entropy-regularized RL algorithm, which provides deeper insights into its effectiveness. Extensive experiments across various LLMs and tasks demonstrate the commonality of the energy loss phenomenon, as well as the effectiveness of EPPO in mitigating reward hacking and improving RLHF performance.
中文: 本研究揭示了强化学习人类反馈中的能量损失现象及其与奖励破解的关联,并提出了一种能量损失感知的PPO算法(EPPO),通过惩罚大语言模型最后一层的能量损失来缓解该问题,从而提升RLHF性能。
English: This study uncovers the Energy Loss Phenomenon in RLHF, linking it to reward hacking and proposing an Energy loss-aware PPO algorithm (EPPO) that mitigates the issue by penalizing energy loss in the LLM's final layer, thereby improving RLHF performance.

Authors:Yiming Liang, Tianyu Zheng, Xinrun Du, Ge Zhang, Jiaheng Liu, Xingwei Qu, Wenqiang Zu, Xingrun Xing, Chujie Zheng, Lei Ma, Guoyin Wang, Zhaoxiang Zhang, Wenhao Huang, Xiang Yue, Jiajun Zhang
Title: Aligning Instruction Tuning with Pre-training
Abstract:
Instruction tuning enhances large language models (LLMs) to follow human instructions across diverse tasks, relying on high-quality datasets to guide behavior. However, these datasets, whether manually curated or synthetically generated, are often narrowly focused and misaligned with the broad distributions captured during pre-training, limiting LLM generalization and effective use of pre-trained knowledge. We propose Aligning Instruction Tuning with Pre-training (AITP), a method that bridges this gap by identifying coverage shortfalls in instruction-tuning datasets and rewriting underrepresented pre-training data into high-quality instruction-response pairs. This approach enriches dataset diversity while preserving task-specific objectives. Evaluations on three fully open LLMs across eight benchmarks demonstrate consistent performance improvements with AITP. Ablations highlight the benefits of adaptive data selection, controlled rewriting, and balanced integration, emphasizing the importance of aligning instruction tuning with pre-training distributions to unlock the full potential of LLMs.
中文:AITP方法通过将预训练中代表性不足的数据转化为指令-响应对,弥合了指令微调与预训练之间的差距,借助增强的数据多样性和对齐性,持续提升了大型语言模型在多个基准测试中的性能。
English: AITP method bridges the gap between instruction tuning and pre-training by converting underrepresented pre-training data into instruction-response pairs, consistently improving LLM performance across benchmarks through enhanced dataset diversity and alignment.

Authors:Zhenhailong Wang, Haiyang Xu, Junyang Wang, Xi Zhang, Ming Yan, Ji Zhang, Fei Huang, Heng Ji
Title: Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks
Abstract:
Smartphones have become indispensable in modern life, yet navigating complex tasks on mobile devices often remains frustrating. Recent advancements in large multimodal model (LMM)-based mobile agents have demonstrated the ability to perceive and act in mobile environments. However, current approaches face significant limitations: they fall short in addressing real-world human needs, struggle with reasoning-intensive and long-horizon tasks, and lack mechanisms to learn and improve from prior experiences. To overcome these challenges, we introduce Mobile-Agent-E, a hierarchical multi-agent framework capable of self-evolution through past experience. By hierarchical, we mean an explicit separation of high-level planning and low-level action execution. The framework comprises a Manager, responsible for devising overall plans by breaking down complex tasks into subgoals, and four subordinate agents--Perceptor, Operator, Action Reflector, and Notetaker--which handle fine-grained visual perception, immediate action execution, error verification, and information aggregation, respectively. Mobile-Agent-E also features a novel self-evolution module which maintains a persistent long-term memory comprising Tips and Shortcuts. Tips are general guidance and lessons learned from prior tasks on how to effectively interact with the environment. Shortcuts are reusable, executable sequences of atomic operations tailored for specific subroutines. The inclusion of Tips and Shortcuts facilitates continuous refinement in performance and efficiency. Alongside this framework, we introduce Mobile-Eval-E, a new benchmark featuring complex mobile tasks requiring long-horizon, multi-app interactions. Empirical results show that Mobile-Agent-E achieves a 22% absolute improvement over previous state-of-the-art approaches across three foundation model backbones. Project page: https://x-plug.github.io/MobileAgent.
中文: Mobile-Agent-E是一种分层多智能体框架,通过长期记忆和专门下属代理实现自我进化,解决了现有移动代理的不足,性能比先前最优方法提升了22%。
English: Mobile-Agent-E is a hierarchical multi-agent framework that addresses limitations in current mobile agents by enabling self-evolution through long-term memory and specialized subordinate agents, achieving a 22% performance improvement over prior methods.

Authors:Xinyue Shen, Yixin Wu, Yiting Qu, Michael Backes, Savvas Zannettou, Yang Zhang
Title: HateBench: Benchmarking Hate Speech Detectors on LLM-Generated Content and Hate Campaigns
Abstract:
Large Language Models (LLMs) have raised increasing concerns about their misuse in generating hate speech. Among all the efforts to address this issue, hate speech detectors play a crucial role. However, the effectiveness of different detectors against LLM-generated hate speech remains largely unknown. In this paper, we propose HateBench, a framework for benchmarking hate speech detectors on LLM-generated hate speech. We first construct a hate speech dataset of 7,838 samples generated by six widely-used LLMs covering 34 identity groups, with meticulous annotations by three labelers. We then assess the effectiveness of eight representative hate speech detectors on the LLM-generated dataset. Our results show that while detectors are generally effective in identifying LLM-generated hate speech, their performance degrades with newer versions of LLMs. We also reveal the potential of LLM-driven hate campaigns, a new threat that LLMs bring to the field of hate speech detection. By leveraging advanced techniques like adversarial attacks and model stealing attacks, the adversary can intentionally evade the detector and automate hate campaigns online. The most potent adversarial attack achieves an attack success rate of 0.966, and its attack efficiency can be further improved by $13-21\times$ through model stealing attacks with acceptable attack performance. We hope our study can serve as a call to action for the research community and platform moderators to fortify defenses against these emerging threats.
中文: 本研究提出HateBench基准框架,发现现有仇恨言论检测器虽能识别大语言模型生成内容,但对新版模型及复杂对抗攻击的检测效果下降,亟需加强防御应对这一新兴威胁。
English: This study introduces HateBench, a benchmarking framework revealing that while current hate speech detectors can identify LLM-generated content, their effectiveness diminishes against newer LLM versions and sophisticated adversarial attacks, highlighting an urgent need for improved defenses.

Authors:Mingjie Li, Wai Man Si, Michael Backes, Yang Zhang, Yisen Wang
Title: SaLoRA: Safety-Alignment Preserved Low-Rank Adaptation
Abstract:
As advancements in large language models (LLMs) continue and the demand for personalized models increases, parameter-efficient fine-tuning (PEFT) methods (e.g., LoRA) will become essential due to their efficiency in reducing computation costs. However, recent studies have raised alarming concerns that LoRA fine-tuning could potentially compromise the safety alignment in LLMs, posing significant risks for the model owner. In this paper, we first investigate the underlying mechanism by analyzing the changes in safety alignment related features before and after fine-tuning. Then, we propose a fixed safety module calculated by safety data and a task-specific initialization for trainable parameters in low-rank adaptations, termed Safety-alignment preserved Low-Rank Adaptation (SaLoRA). Unlike previous LoRA methods and their variants, SaLoRA enables targeted modifications to LLMs without disrupting their original alignments. Our experiments show that SaLoRA outperforms various adapters-based approaches across various evaluation metrics in different fine-tuning tasks.
中文:SaLoRA通过引入固定安全模块和任务特定初始化,在参数高效微调中保持大语言模型的安全对齐,并在多项任务中优于其他适配器方法。
English: SaLoRA introduces a fixed safety module and task-specific initialization to maintain safety alignment in LLMs during parameter-efficient fine-tuning, outperforming other adapters across multiple tasks.

Authors:Hanwen Zhang, Ruichen Zhang, Wei Zhang, Dusit Niyato, Yonggang Wen, Chunyan Miao
Title: Advancing Generative Artificial Intelligence and Large Language Models for Demand Side Management with Internet of Electric Vehicles
Abstract:
Generative artificial intelligence, particularly through large language models (LLMs), is poised to transform energy optimization and demand side management (DSM) within microgrids. This paper explores the integration of LLMs into energy management, emphasizing their roles in automating the optimization of DSM strategies with Internet of electric vehicles. We investigate challenges and solutions associated with DSM and explore the new opportunities presented by leveraging LLMs. Then, we propose an innovative solution that enhances LLMs with retrieval-augmented generation for automatic problem formulation, code generation, and customizing optimization. We present a case study to demonstrate the effectiveness of our proposed solution in charging scheduling and optimization for electric vehicles, highlighting our solution's significant advancements in energy efficiency and user adaptability. This work underscores the potential of LLMs for energy optimization and fosters a new era of intelligent DSM solutions.
中文: 本文提出一种结合检索增强的生成式AI解决方案,通过案例研究证明其能自动优化微电网中的电动汽车充电调度,显著提升能源效率和用户适应性。
English: This paper proposes a retrieval-augmented generative AI solution that automates electric vehicle charging optimization in microgrids, demonstrating significant improvements in energy efficiency and user adaptability through a case study.

Authors:Ziyu Zhao, Yixiao Zhou, Zhi Zhang, Didi Zhu, Tao Shen, Zexi Li, Jinluan Yang, Xuwu Wang, Jing Su, Kun Kuang, Zhongyu Wei, Fei Wu, Yu Cheng
Title: Each Rank Could be an Expert: Single-Ranked Mixture of Experts LoRA for Multi-Task Learning
Abstract:
Low-Rank Adaptation (LoRA) is widely used for adapting large language models (LLMs) to specific domains due to its efficiency and modularity. Meanwhile, vanilla LoRA struggles with task conflicts in multi-task scenarios. Recent works adopt Mixture of Experts (MoE) by treating each LoRA module as an expert, thereby mitigating task interference through multiple specialized LoRA modules. While effective, these methods often isolate knowledge within individual tasks, failing to fully exploit the shared knowledge across related tasks. In this paper, we establish a connection between single LoRA and multi-LoRA MoE, integrating them into a unified framework. We demonstrate that the dynamic routing of multiple LoRAs is functionally equivalent to rank partitioning and block-level activation within a single LoRA. We further empirically demonstrate that finer-grained LoRA partitioning, within the same total and activated parameter constraints, leads to better performance gains across heterogeneous tasks. Building on these findings, we propose Single-ranked Mixture of Experts LoRA (\textbf{SMoRA}), which embeds MoE into LoRA by \textit{treating each rank as an independent expert}. With a \textit{dynamic rank-wise activation} mechanism, SMoRA promotes finer-grained knowledge sharing while mitigating task conflicts. Experiments demonstrate that SMoRA activates fewer parameters yet achieves better performance in multi-task scenarios.
中文: SMoRA将专家混合机制融入LoRA,将每个秩视为独立专家,通过动态秩激活机制实现更细粒度的知识共享,在减少任务冲突的同时以更少参数获得更优性能。
English: SMoRA integrates Mixture of Experts into LoRA by treating each rank as an independent expert, enabling finer-grained knowledge sharing and dynamic rank-wise activation to reduce task conflicts while improving performance with fewer parameters.

Authors:Zhangfeng Ma, Ruichen Zhang, Bo Ai, Zhuxian Lian, Linzhou Zeng, Dusit Niyato
Title: Deep Reinforcement Learning for Energy Efficiency Maximization in RSMA-IRS-Assisted ISAC System
Abstract:
This paper proposes a three-dimensional (3D) geometry-based channel model to accurately represent intelligent reflecting surfaces (IRS)-enhanced integrated sensing and communication (ISAC) networks using rate-splitting multiple access (RSMA) in practical urban environments. Based on this model, we formulate an energy efficiency (EE) maximization problem that incorporates transceiver beamforming constraints, IRS phase adjustments, and quality-of-service (QoS) requirements to optimize communication and sensing functions. To solve this problem, we use the proximal policy optimization (PPO) algorithm within a deep reinforcement learning (DRL) framework. Our numerical results confirm the effectiveness of the proposed method in improving EE and satisfying QoS requirements. Additionally, we observe that system EE drops at higher frequencies, especially under double-Rayleigh fading.
中文: 本文提出了一种基于三维几何的智能反射面增强集成感知与通信网络信道模型,采用近端策略优化算法实现能效最大化并满足服务质量要求,数值结果表明该方法能有效提升能效,但在高频段特别是双瑞利衰落环境下系统性能会下降。
English: This paper introduces a 3D geometry-based channel model for IRS-enhanced ISAC networks with RSMA and employs the PPO algorithm to maximize energy efficiency while meeting QoS requirements, with results showing improved EE but reduced performance at higher frequencies under double-Rayleigh fading.

Authors:Yushen Lin, Ruichen Zhang, Wenqi Huang, Kaidi Wang, Zhiguo Ding, Daniel K. C. So, Dusit Niyato
Title: Empowering Large Language Models in Wireless Communication: A Novel Dataset and Fine-Tuning Framework
Abstract:
In this work, we develop a specialized dataset aimed at enhancing the evaluation and fine-tuning of large language models (LLMs) specifically for wireless communication applications. The dataset includes a diverse set of multi-hop questions, including true/false and multiple-choice types, spanning varying difficulty levels from easy to hard. By utilizing advanced language models for entity extraction and question generation, rigorous data curation processes are employed to maintain high quality and relevance. Additionally, we introduce a Pointwise V-Information (PVI) based fine-tuning method, providing a detailed theoretical analysis and justification for its use in quantifying the information content of training data with 2.24\% and 1.31\% performance boost for different models compared to baselines, respectively. To demonstrate the effectiveness of the fine-tuned models with the proposed methodologies on practical tasks, we also consider different tasks, including summarizing optimization problems from technical papers and solving the mathematical problems related to non-orthogonal multiple access (NOMA), which are generated by using the proposed multi-agent framework. Simulation results show significant performance gain in summarization tasks with 20.9\% in the ROUGE-L metrics. We also study the scaling laws of fine-tuning LLMs and the challenges LLMs face in the field of wireless communications, offering insights into their adaptation to wireless communication tasks. This dataset and fine-tuning methodology aim to enhance the training and evaluation of LLMs, contributing to advancements in LLMs for wireless communication research and applications.
中文摘要:本研究开发了专用数据集并提出基于PVI的微调方法,显著提升了大型语言模型在无线通信任务中的性能,在摘要生成和数学问题求解方面取得突破性进展。
English Summary: This study creates a specialized dataset and introduces a PVI-based fine-tuning method to enhance LLMs for wireless communication tasks, achieving notable performance improvements in summarization and mathematical problem-solving.

Authors:Zheqi Lv, Keming Ye, Zishu Wei, Qi Tian, Shengyu Zhang, Wenqiao Zhang, Wenjie Wang, Kun Kuang, Tat-Seng Chua, Fei Wu
Title: Optimize Incompatible Parameters through Compatibility-aware Knowledge Integration
Abstract:
Deep neural networks have become foundational to advancements in multiple domains, including recommendation systems, natural language processing, and so on. Despite their successes, these models often contain incompatible parameters that can be underutilized or detrimental to model performance, particularly when faced with specific, varying data distributions. Existing research excels in removing such parameters or merging the outputs of multiple different pretrained models. However, the former focuses on efficiency rather than performance, while the latter requires several times more computing and storage resources to support inference. In this paper, we set the goal to explicitly improve these incompatible parameters by leveraging the complementary strengths of different models, thereby directly enhancing the models without any additional parameters. Specifically, we propose Compatibility-aware Knowledge Integration (CKI), which consists of Parameter Compatibility Assessment and Parameter Splicing, which are used to evaluate the knowledge content of multiple models and integrate the knowledge into one model, respectively. The integrated model can be used directly for inference or for further fine-tuning. We conduct extensive experiments on various datasets for recommendation and language tasks, and the results show that Compatibility-aware Knowledge Integration can effectively optimize incompatible parameters under multiple tasks and settings to break through the training limit of the original model without increasing the inference cost.
中文: 本文提出兼容性知识集成方法,通过评估多个模型的参数兼容性并整合其互补知识,有效优化神经网络中的不兼容参数,在不增加推理成本的情况下提升模型性能。
English: This paper introduces Compatibility-aware Knowledge Integration (CKI), a method that assesses and optimizes incompatible parameters in deep neural networks by integrating complementary knowledge from multiple models, enhancing performance without additional inference costs.

Authors:Muqing Cao, Thien-Minh Nguyen, Shenghai Yuan, Andreas Anastasiou, Angelos Zacharia, Savvas Papaioannou, Panayiotis Kolios, Christos G. Panayiotou, Marios M. Polycarpou, Xinhang Xu, Mingjie Zhang, Fei Gao, Boyu Zhou, Ben M. Chen, Lihua Xie
Title: Cooperative Aerial Robot Inspection Challenge: A Benchmark for Heterogeneous Multi-UAV Planning and Lessons Learned
Abstract:
We propose the Cooperative Aerial Robot Inspection Challenge (CARIC), a simulation-based benchmark for motion planning algorithms in heterogeneous multi-UAV systems. CARIC features UAV teams with complementary sensors, realistic constraints, and evaluation metrics prioritizing inspection quality and efficiency. It offers a ready-to-use perception-control software stack and diverse scenarios to support the development and evaluation of task allocation and motion planning algorithms. Competitions using CARIC were held at IEEE CDC 2023 and the IROS 2024 Workshop on Multi-Robot Perception and Navigation, attracting innovative solutions from research teams worldwide. This paper examines the top three teams from CDC 2023, analyzing their exploration, inspection, and task allocation strategies while drawing insights into their performance across scenarios. The results highlight the task's complexity and suggest promising directions for future research in cooperative multi-UAV systems.
中文:协同空中机器人巡检挑战赛(CARIC)是一个面向异构多无人机运动规划的仿真基准,具备互补传感器配置和真实约束条件,通过分析顶尖团队的巡检策略揭示了任务复杂性,为协同无人机系统研究指明了方向。
English: The Cooperative Aerial Robot Inspection Challenge (CARIC) is a simulation benchmark for heterogeneous multi-UAV motion planning, featuring realistic constraints and evaluation metrics to advance cooperative inspection algorithms, with top strategies from recent competitions analyzed to guide future research.

Authors:Yinghao Hu, Leilei Gan, Wenyi Xiao, Kun Kuang, Fei Wu
Title: Fine-tuning Large Language Models for Improving Factuality in Legal Question Answering
Abstract:
Hallucination, or the generation of incorrect or fabricated information, remains a critical challenge in large language models (LLMs), particularly in high-stake domains such as legal question answering (QA). In order to mitigate the hallucination rate in legal QA, we first introduce a benchmark called LegalHalBench and three automatic metrics to evaluate the common hallucinations when LLMs answer legal questions. We then propose a hallucination mitigation method that integrates behavior cloning and a novel Hard Sample-aware Iterative Direct Preference Optimization (HIPO). We conduct extensive real-data experiments to validate the effectiveness of our approach. Our results demonstrate remarkable improvements in various metrics, including the newly proposed Non-Hallucinated Statute Rate, Statute Relevance Rate, Legal Claim Truthfulness, as well as traditional metrics such as METEOR, BERTScore, ROUGE-L, and win rates.
中文: 针对法律问答中大型语言模型的幻觉问题,本研究提出了LegalHalBench基准及三项自动评估指标,并采用行为克隆与新型硬样本感知迭代直接偏好优化相结合的方法,实验证明该方法在新型与传统评估指标上均取得显著提升。
English: To address the challenge of hallucination in large language models for legal question answering, this study introduces a benchmark called LegalHalBench with three automatic metrics and proposes a mitigation method combining behavior cloning and Hard Sample-aware Iterative Direct Preference Optimization (HIPO), which significantly improves both novel and traditional evaluation metrics in experiments.

Authors:Zheqi Lv, Tianyu Zhan, Wenjie Wang, Xinyu Lin, Shengyu Zhang, Wenqiao Zhang, Jiwei Li, Kun Kuang, Fei Wu
Title: Collaboration of Large Language Models and Small Recommendation Models for Device-Cloud Recommendation
Abstract:
Large Language Models (LLMs) for Recommendation (LLM4Rec) is a promising research direction that has demonstrated exceptional performance in this field. However, its inability to capture real-time user preferences greatly limits the practical application of LLM4Rec because (i) LLMs are costly to train and infer frequently, and (ii) LLMs struggle to access real-time data (its large number of parameters poses an obstacle to deployment on devices). Fortunately, small recommendation models (SRMs) can effectively supplement these shortcomings of LLM4Rec diagrams by consuming minimal resources for frequent training and inference, and by conveniently accessing real-time data on devices. In light of this, we designed the Device-Cloud LLM-SRM Collaborative Recommendation Framework (LSC4Rec) under a device-cloud collaboration setting. LSC4Rec aims to integrate the advantages of both LLMs and SRMs, as well as the benefits of cloud and edge computing, achieving a complementary synergy. We enhance the practicability of LSC4Rec by designing three strategies: collaborative training, collaborative inference, and intelligent request. During training, LLM generates candidate lists to enhance the ranking ability of SRM in collaborative scenarios and enables SRM to update adaptively to capture real-time user interests. During inference, LLM and SRM are deployed on the cloud and on the device, respectively. LLM generates candidate lists and initial ranking results based on user behavior, and SRM get reranking results based on the candidate list, with final results integrating both LLM's and SRM's scores. The device determines whether a new candidate list is needed by comparing the consistency of the LLM's and SRM's sorted lists. Our comprehensive and extensive experimental analysis validates the effectiveness of each strategy in LSC4Rec.
中文摘要:LSC4Rec框架通过设备-云协同整合大语言模型与小推荐模型,利用协同训练、协同推理和智能请求三大策略,既弥补了大模型实时性不足的缺陷,又充分发挥了云端与终端计算的协同优势。
English Summary: The LSC4Rec framework combines large language models (LLMs) and small recommendation models (SRMs) through device-cloud collaboration to overcome LLMs' limitations in real-time preference capture while leveraging their complementary strengths via coordinated training and inference strategies.

Authors:Jianfeng Xu, Congcong Liu, Xiaoying Tan, Xiaojie Zhu, Anpeng Wu, Huan Wan, Weijun Kong, Chun Li, Hu Xu, Kun Kuang, Fei Wu
Title: General Information Metrics for Improving AI Model Training Efficiency
Abstract:
To address the growing size of AI model training data and the lack of a universal data selection methodology-factors that significantly drive up training costs -- this paper presents the General Information Metrics Evaluation (GIME) method. GIME leverages general information metrics from Objective Information Theory (OIT), including volume, delay, scope, granularity, variety, duration, sampling rate, aggregation, coverage, distortion, and mismatch to optimize dataset selection for training purposes. Comprehensive experiments conducted across diverse domains, such as CTR Prediction, Civil Case Prediction, and Weather Forecasting, demonstrate that GIME effectively preserves model performance while substantially reducing both training time and costs. Additionally, applying GIME within the Judicial AI Program led to a remarkable 39.56% reduction in total model training expenses, underscoring its potential to support efficient and sustainable AI development.
中文: 本文提出的通用信息度量评估(GIME)方法,通过客观信息指标优化训练数据集选择,在多个应用领域中显著降低训练成本和时间,同时保持模型性能。
English: This paper introduces the General Information Metrics Evaluation (GIME) method, which utilizes objective information metrics to optimize training dataset selection, significantly cutting costs and time while maintaining model performance across various applications.

Authors:Tong Xiao, Jingbo Zhu
Title: Foundations of Large Language Models
Abstract:
This is a book about large language models. As indicated by the title, it primarily focuses on foundational concepts rather than comprehensive coverage of all cutting-edge technologies. The book is structured into five main chapters, each exploring a key area: pre-training, generative models, prompting, alignment, and inference. It is intended for college students, professionals, and practitioners in natural language processing and related fields, and can serve as a reference for anyone interested in large language models.
本书介绍了大型语言模型的基础知识,涵盖预训练、生成模型、提示、对齐和推理等主题,适合自然语言处理领域的学生和专业人士参考。
This book introduces the fundamental concepts of large language models, covering pre-training, generative models, prompting, alignment, and inference, and is designed for students and professionals in natural language processing.

Authors:Weiqiao Shan, Yuhao Zhang, Yuchen Han, Bei Li, Xiaofeng Zhao, Yuang Li, Min Zhang, Hao Yang, Tong Xiao, Jingbo Zhu
Title: Optimizing Speech Multi-View Feature Fusion through Conditional Computation
Abstract:
Recent advancements have highlighted the efficacy of self-supervised learning (SSL) features in various speech-related tasks, providing lightweight and versatile multi-view speech representations. However, our study reveals that while SSL features expedite model convergence, they conflict with traditional spectral features like FBanks in terms of update directions. In response, we propose a novel generalized feature fusion framework grounded in conditional computation, featuring a gradient-sensitive gating network and a multi-stage dropout strategy. This framework mitigates feature conflicts and bolsters model robustness to multi-view input features. By integrating SSL and spectral features, our approach accelerates convergence and maintains performance on par with spectral models across multiple speech translation tasks on the MUSTC dataset.
中文摘要:本研究提出了一种基于条件计算的通用特征融合框架,通过梯度敏感门控网络和多阶段丢弃策略,有效缓解了自监督学习特征与频谱特征间的冲突,在语音翻译任务中既加速收敛又保持性能。
English Summary: The study introduces a generalized feature fusion framework that resolves conflicts between self-supervised learning and spectral features, accelerating convergence while maintaining performance in speech translation tasks.

Authors:En Xu, Can Rong, Jingtao Ding, Yong Li
Title: A Diffusive Data Augmentation Framework for Reconstruction of Complex Network Evolutionary History
Abstract:
The evolutionary processes of complex systems contain critical information regarding their functional characteristics. The generation time of edges provides insights into the historical evolution of various networked complex systems, such as protein-protein interaction networks, ecosystems, and social networks. Recovering these evolutionary processes holds significant scientific value, including aiding in the interpretation of the evolution of protein-protein interaction networks. However, existing methods are capable of predicting the generation times of remaining edges given a partial temporal network but often perform poorly in cross-network prediction tasks. These methods frequently fail in edge generation time recovery tasks for static networks that lack timestamps. In this work, we adopt a comparative paradigm-based framework that fuses multiple networks for training, enabling cross-network learning of the relationship between network structure and edge generation times. Compared to separate training, this approach yields an average accuracy improvement of 16.98%. Furthermore, given the difficulty in collecting temporal networks, we propose a novel diffusion-model-based generation method to produce a large number of temporal networks. By combining real temporal networks with generated ones for training, we achieve an additional average accuracy improvement of 5.46% through joint training.
Chinese: 本研究提出一种基于比较范式的框架,通过融合多个网络进行跨网络学习,将边生成时间预测准确率平均提升16.98%,并利用基于扩散模型的时间网络生成方法进行联合训练,进一步实现5.46%的平均准确率提升。
English: This study introduces a comparative paradigm-based framework that integrates multiple networks for cross-network learning, improving edge generation time prediction accuracy by 16.98%, and further enhances performance by 5.46% through a diffusion-model-based method that generates temporal networks for joint training.

Authors:Yunzhuo Hao, Jiawei Gu, Huichen Will Wang, Linjie Li, Zhengyuan Yang, Lijuan Wang, Yu Cheng
Title: Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark
Abstract:
The ability to organically reason over and with both text and images is a pillar of human intelligence, yet the ability of Multimodal Large Language Models (MLLMs) to perform such multimodal reasoning remains under-explored. Existing benchmarks often emphasize text-dominant reasoning or rely on shallow visual cues, failing to adequately assess integrated visual and textual reasoning. We introduce EMMA (Enhanced MultiModal reAsoning), a benchmark targeting organic multimodal reasoning across mathematics, physics, chemistry, and coding. EMMA tasks demand advanced cross-modal reasoning that cannot be addressed by reasoning independently in each modality, offering an enhanced test suite for MLLMs' reasoning capabilities. Our evaluation of state-of-the-art MLLMs on EMMA reveals significant limitations in handling complex multimodal and multi-step reasoning tasks, even with advanced techniques like Chain-of-Thought prompting and test-time compute scaling underperforming. These findings underscore the need for improved multimodal architectures and training paradigms to close the gap between human and model reasoning in multimodality.
中文: EMMA作为新型基准测试,旨在通过要求跨复杂领域对文本和图像进行整合分析来严格评估多模态推理能力,揭示了当前多模态大语言模型即使采用先进技术仍存在显著缺陷。
English: EMMA is a new benchmark designed to rigorously evaluate multimodal reasoning by requiring integrated analysis of text and images across complex domains, revealing current MLLMs' significant limitations despite advanced techniques.

Authors:Yuchun Fan, Yongyu Mu, Yilin Wang, Lei Huang, Junhao Ruan, Bei Li, Tong Xiao, Shujian Huang, Xiaocheng Feng, Jingbo Zhu
Title: SLAM: Towards Efficient Multilingual Reasoning via Selective Language Alignment
Abstract:
Despite the significant improvements achieved by large language models (LLMs) in English reasoning tasks, these models continue to struggle with multilingual reasoning. Recent studies leverage a full-parameter and two-stage training paradigm to teach models to first understand non-English questions and then reason. However, this method suffers from both substantial computational resource computing and catastrophic forgetting. The fundamental cause is that, with the primary goal of enhancing multilingual comprehension, an excessive number of irrelevant layers and parameters are tuned during the first stage. Given our findings that the representation learning of languages is merely conducted in lower-level layers, we propose an efficient multilingual reasoning alignment approach that precisely identifies and fine-tunes the layers responsible for handling multilingualism. Experimental results show that our method, SLAM, only tunes 6 layers' feed-forward sub-layers including 6.5-8% of all parameters within 7B and 13B LLMs, achieving superior average performance than all strong baselines across 10 languages. Meanwhile, SLAM only involves one training stage, reducing training time by 4.1-11.9 compared to the two-stage method.
中文摘要:SLAM方法通过仅微调负责语言表征的6个关键层,在10种语言上实现了优于所有基线模型的多语言推理性能,同时大幅降低了计算资源和训练时间消耗。
English Summary: The SLAM method efficiently enhances multilingual reasoning in large language models by selectively fine-tuning only 6 layers responsible for language representation, achieving superior performance across 10 languages with significantly reduced computational costs and training time.

Authors:Md Raz, P. V. Sai Charan, Prashanth Krishnamurthy, Farshad Khorrami, Ramesh Karri
Title: SHIELD: Secure Host-Independent Extensible Logging for Tamper-Proof Detection and Real-Time Mitigation of Ransomware Threats
Abstract:
Ransomware's escalating sophistication necessitates tamper-resistant, off-host detection solutions that capture deep disk activity beyond the reach of a compromised operating system while overcoming evasion and obfuscation techniques. To address this, we introduce SHIELD: a metric acquisition framework leveraging low-level filesystem monitoring and Network Block Device (NBD) technology to provide off-host, tamper-proof measurements for continuous observation of disk activity exhibited by software executing on a target device. We employ Shield within a detection architecture leveraging deep filesystem features along with simplified metrics aggregated based on frequency of disk actions, making the metrics impervious to obfuscation while avoiding reliance on vulnerable host-based logs. We evaluate the efficacy of these metrics through extensive experiments with both binary (benign vs. malicious behavior) and multiclass (ransomware strain identification) classifiers and confirm that our metrics yield high accuracy across diverse threat profiles, including intermittent or partial encryption. In a proof-of-concept deployment, we demonstrate real-time mitigation using models trained on these metrics by halting malicious disk operations after ransomware detection with minimum file loss and memory corruption. We also show that hardware-only features collected independently of OS or network stack retain high detection effectiveness, verifying feasibility of embedding the proposed pipeline in a SATA controller ASIC or FPGA for next-generation, disk-centric defenses that combine filesystem insight with inherent off-host isolation.
SHIELD is a tamper-resistant framework that uses low-level filesystem monitoring and NBD technology for off-host ransomware detection, achieving high accuracy against evasion techniques through disk activity analysis and enabling real-time mitigation with minimal data loss.
English Summary:

Authors:Farshad Khorrami, Ramesh Karri, Prashanth Krishnamurthy
Title: Real-Time Multi-Modal Subcomponent-Level Measurements for Trustworthy System Monitoring and Malware Detection
Abstract:
With increasingly sophisticated cyber-adversaries able to access a wider repertoire of mechanisms to implant malware such as ransomware, CPU/GPU keyloggers, and stealthy kernel rootkits, there is an urgent need for techniques to detect and mitigate such attacks. While state of the art relies on digital and analog side channel measurements assuming trustworthiness of measurements obtained on the main processor, such an approach has limitations since processor-based side channel measurements are potentially untrustworthy. Sophisticated adversaries (especially in late stage cyber attacks when they have breached the computer and network security systems such as firewalls and antivirus and penetrated the computer's OS) can compromise user-space and kernel-space measurements. To address this key limitation of state of the art, we propose a "subcomponent-level" approach to collect side channel measurements so as to enable robust anomaly detection in a modern computer even when the main processor is compromised. Our proposed approach leverages the fact that modern computers are complex systems with multiple interacting subcomponents and measurements from subcomponents can be used to detect anomalies even when the main processor is no longer trustworthy. We develop mechanisms to obtain time series measurements of activity of several subcomponents and methodologies to process and fuse these measurements for anomaly detection. The subcomponents include network interface controller, GPU, CPU Hardware Performance Counters, CPU power, and keyboard. Our main hypothesis is that subcomponent measurements can enable detection of security threats without requiring a trustworthy main processor. By enabling real-time measurements from multiple subcomponents, the goal is to provide a deeper visibility into system operation, thereby yielding a powerful tool to track system operation and detect anomalies.
Chinese: 针对能够破坏主处理器安全性的高级网络威胁,本研究提出了一种子组件级方法,利用网络控制器和GPU等组件的侧信道测量,即使在主处理器不可信时也能实现稳健的异常检测。
English: To counter advanced cyber threats that can compromise main processor-based security, this study introduces a subcomponent-level approach using side channel measurements from components like network controllers and GPUs for robust anomaly detection even when the main processor is untrustworthy.

Authors:Keer Lu, Zheng Liang, Zhuoran Zhang, Da Pan, Shusen Zhang, Xin Wu, Zenan Zhou, Guosheng Dong, Bin Cui, Tengjiao Wang, Wentao Zhang
Title: Med-R$^2$: Crafting Trustworthy LLM Physicians via Retrieval and Reasoning of Evidence-Based Medicine
Abstract:
Large Language Models (LLMs) have exhibited remarkable capabilities in clinical scenarios. Despite their potential, existing works face challenges when applying LLMs to medical settings. Strategies relying on training with medical datasets are highly cost-intensive and may suffer from outdated training data. Leveraging external knowledge bases is a suitable alternative, yet it faces obstacles such as limited retrieval precision and poor effectiveness in answer extraction. These issues collectively prevent LLMs from demonstrating the expected level of proficiency in mastering medical expertise. To address these challenges, we introduce Med-R^2, a novel LLM physician framework that adheres to the Evidence-Based Medicine (EBM) process, efficiently integrating retrieval mechanisms as well as the selection and reasoning processes of evidence, thereby enhancing the problem-solving capabilities of LLMs in healthcare scenarios and fostering a trustworthy LLM physician. Our comprehensive experiments indicate that Med-R^2 achieves a 14.74\% improvement over vanilla RAG methods and even a 3.32\% enhancement compared to fine-tuning strategies, without incurring additional training costs.
中文:Med-R²框架通过整合循证检索与推理机制,有效提升大语言模型在医疗领域的能力,无需额外训练成本即显著超越现有方法的性能表现。
English: The Med-R² framework enhances LLMs' medical proficiency by integrating evidence-based retrieval and reasoning, achieving significant performance improvements over existing methods without additional training costs.

Authors:Keer Lu, Zheng Liang, Da Pan, Shusen Zhang, Guosheng Dong, Zhonghai Wu, Huang Leng, Bin Cui, Wentao Zhang
Title: Med-R$^2$: Crafting Trustworthy LLM Physicians via Retrieval and Reasoning of Evidence-Based Medicine
Abstract:
Large Language Models (LLMs) have exhibited remarkable capabilities in clinical scenarios. Despite their potential, existing works face challenges when applying LLMs to medical settings. Strategies relying on training with medical datasets are highly cost-intensive and may suffer from outdated training data. Leveraging external knowledge bases is a suitable alternative, yet it faces obstacles such as limited retrieval precision and poor effectiveness in answer extraction. These issues collectively prevent LLMs from demonstrating the expected level of proficiency in mastering medical expertise. To address these challenges, we introduce Med-R^2, a novel LLM physician framework that adheres to the Evidence-Based Medicine (EBM) process, efficiently integrating retrieval mechanisms as well as the selection and reasoning processes of evidence, thereby enhancing the problem-solving capabilities of LLMs in healthcare scenarios and fostering a trustworthy LLM physician. Our comprehensive experiments indicate that Med-R^2 achieves a 13.27\% improvement over vanilla RAG methods and even a 4.55\% enhancement compared to fine-tuning strategies, without incurring additional training costs. Furthermore, we find that our LLaMA3.1-70B + Med-R$^2$ surpasses frontier models, including GPT-4o, Claude3.5-Sonnet and DeepSeek-V3 by 1.05\%, 6.14\% and 1.91\%. Med-R$^2$ effectively enhances the capabilities of LLMs in the medical domain.
中文:Med-R²框架通过整合循证检索与推理机制,有效提升大语言模型在医疗领域的能力,无需额外训练成本即显著超越现有方法的性能表现。
English: The Med-R² framework enhances LLMs' medical proficiency by integrating evidence-based retrieval and reasoning, achieving significant performance improvements over existing methods without additional training costs.

Authors:Botao Zhao, Xiaoyang Qu, Zuheng Kang, Junqing Peng, Jing Xiao, Jianzong Wang
Title: ACCon: Angle-Compensated Contrastive Regularizer for Deep Regression
Abstract:
In deep regression, capturing the relationship among continuous labels in feature space is a fundamental challenge that has attracted increasing interest. Addressing this issue can prevent models from converging to suboptimal solutions across various regression tasks, leading to improved performance, especially for imbalanced regression and under limited sample sizes. However, existing approaches often rely on order-aware representation learning or distance-based weighting. In this paper, we hypothesize a linear negative correlation between label distances and representation similarities in regression tasks. To implement this, we propose an angle-compensated contrastive regularizer for deep regression, which adjusts the cosine distance between anchor and negative samples within the contrastive learning framework. Our method offers a plug-and-play compatible solution that extends most existing contrastive learning methods for regression tasks. Extensive experiments and theoretical analysis demonstrate that our proposed angle-compensated contrastive regularizer not only achieves competitive regression performance but also excels in data efficiency and effectiveness on imbalanced datasets.
中文摘要:本文提出一种角度补偿对比正则器,通过调整样本间的余弦距离来增强深度回归中的表示学习,实验证明该方法在数据效率和类别不平衡回归任务中表现优异。
English Summary: This paper introduces an angle-compensated contrastive regularizer for deep regression that adjusts cosine distances between samples based on label proximity, demonstrating improved performance and data efficiency across various regression tasks.

Authors:Lin Yuan, Kai Liang, Xiong Li, Tao Wu, Nannan Wang, Xinbo Gao
Title: iFADIT: Invertible Face Anonymization via Disentangled Identity Transform
Abstract:
Face anonymization aims to conceal the visual identity of a face to safeguard the individual's privacy. Traditional methods like blurring and pixelation can largely remove identifying features, but these techniques significantly degrade image quality and are vulnerable to deep reconstruction attacks. Generative models have emerged as a promising solution for anonymizing faces while preserving a natural appearance. However, many still face limitations in visual quality and often overlook the potential to recover the original face from the anonymized version, which can be valuable in specific contexts such as image forensics. This paper proposes a novel framework named iFADIT, an acronym for Invertible Face Anonymization via Disentangled Identity Transform. The framework features a disentanglement architecture coupled with a secure flow-based model: the former decouples identity information from non-identifying attributes, while the latter transforms the decoupled identity into an anonymized version in an invertible manner controlled by a secret key. The anonymized face can then be reconstructed based on a pre-trained StyleGAN that ensures high image quality and realistic facial details. Recovery of the original face (aka de-anonymization) is possible upon the availability of the matching secret, by inverting the anonymization process based on the same set of model parameters. Furthermore, a dedicated secret-key mechanism along with a dual-phase training strategy is devised to ensure the desired properties of face anonymization. Qualitative and quantitative experiments demonstrate the superiority of the proposed approach in anonymity, reversibility, security, diversity, and interpretability over competing methods.
中文: 本文提出iFADIT框架,通过解耦身份信息与流模型实现可逆人脸匿名化,在保证图像质量的同时支持密钥控制的原始人脸恢复,在匿名性、可逆性和安全性方面优于现有方法。
English: This paper introduces iFADIT, an invertible face anonymization framework that uses a disentanglement architecture and flow-based model to securely transform facial identity while preserving image quality, allowing reversible de-anonymization with a secret key.

Authors:Jiahao Wang, Ning Kang, Lewei Yao, Mengzhao Chen, Chengyue Wu, Songyang Zhang, Shuchen Xue, Yong Liu, Taiqiang Wu, Xihui Liu, Kaipeng Zhang, Shifeng Zhang, Wenqi Shao, Zhenguo Li, Ping Luo
Title: LiT: Delving into a Simplified Linear Diffusion Transformer for Image Generation
Abstract:
In commonly used sub-quadratic complexity modules, linear attention benefits from simplicity and high parallelism, making it promising for image synthesis tasks. However, the architectural design and learning strategy for linear attention remain underexplored in this field. In this paper, we offer a suite of ready-to-use solutions for efficient linear diffusion Transformers. Our core contributions include: (1) Simplified Linear Attention using few heads, observing the free-lunch effect of performance without latency increase. (2) Weight inheritance from a fully pre-trained diffusion Transformer: initializing linear Transformer using pre-trained diffusion Transformer and loading all parameters except for those related to linear attention. (3) Hybrid knowledge distillation objective: using a pre-trained diffusion Transformer to help the training of the student linear Transformer, supervising not only the predicted noise but also the variance of the reverse diffusion process. These guidelines lead to our proposed Linear Diffusion Transformer (LiT), an efficient text-to-image Transformer that can be deployed offline on a laptop. Experiments show that in class-conditional 256*256 and 512*512 ImageNet benchmark LiT achieves highly competitive FID while reducing training steps by 80% and 77% compared to DiT. LiT also rivals methods based on Mamba or Gated Linear Attention. Besides, for text-to-image generation, LiT allows for the rapid synthesis of up to 1K resolution photorealistic images. Project page: https://techmonsterwang.github.io/LiT/.
本文提出了线性扩散变换器(LiT),通过线性注意力和五项优化策略将预训练DiT简化为高效模型,仅需少量训练步骤即可实现相当的图像生成性能。
This paper introduces Linear Diffusion Transformer (LiT), a streamlined version of DiT that uses linear attention and five optimization strategies to achieve efficient image generation with comparable performance using significantly fewer training steps.

Authors:Jiahao Wang, Ning Kang, Lewei Yao, Mengzhao Chen, Chengyue Wu, Songyang Zhang, Shuchen Xue, Yong Liu, Taiqiang Wu, Xihui Liu, Kaipeng Zhang, Shifeng Zhang, Wenqi Shao, Zhenguo Li, Ping Luo
Title: LiT: Delving into a Simple Linear Diffusion Transformer for Image Generation
Abstract:
In this paper, we investigate how to convert a pre-trained Diffusion Transformer (DiT) into a linear DiT, as its simplicity, parallelism, and efficiency for image generation. Through detailed exploration, we offer a suite of ready-to-use solutions, ranging from linear attention design to optimization strategies. Our core contributions include 5 practical guidelines: 1) Applying depth-wise convolution within simple linear attention is sufficient for image generation. 2) Using fewer heads in linear attention provides a free-lunch performance boost without increasing latency. 3) Inheriting weights from a fully converged, pre-trained DiT. 4) Loading all parameters except those related to linear attention. 5) Hybrid knowledge distillation: using a pre-trained teacher DiT to help the training of the student linear DiT, supervising not only the predicted noise but also the variance of the reverse diffusion process. These guidelines lead to our proposed \underline{L}inear D\underline{i}ffusion \underline{T}ransformer (LiT), which serves as a safe and efficient alternative baseline for DiT with pure linear attention. In class-conditional 256$\times$256 and 512$\times$512 ImageNet generation, LiT can be quickly adapted from DiT using only $20\%$ and $33\%$ of DiT's training steps, respectively, while achieving comparable performance. LiT also rivals methods based on Mamba or Gated Linear Attention. Moreover, the same guidelines generalize to text-to-image generation: LiT can be swiftly converted from PixArt-$Σ$ to generate high-quality images, maintaining comparable GenEval scores.
本文提出了线性扩散变换器(LiT),通过线性注意力和五项优化策略将预训练DiT简化为高效模型,仅需少量训练步骤即可实现相当的图像生成性能。
This paper introduces Linear Diffusion Transformer (LiT), a streamlined version of DiT that uses linear attention and five optimization strategies to achieve efficient image generation with comparable performance using significantly fewer training steps.

Authors:Feiteng Mu, Liwen Zhang, Yong Jiang, Wenjie Li, Zhen Zhang, Pengjun Xie, Fei Huang
Title: Unsupervised Query Routing for Retrieval Augmented Generation
Abstract:
Query routing for retrieval-augmented generation aims to assign an input query to the most suitable search engine. Existing works rely heavily on supervised datasets that require extensive manual annotation, resulting in high costs and limited scalability, as well as poor generalization to out-of-distribution scenarios. To address these challenges, we introduce a novel unsupervised method that constructs the "upper-bound" response to evaluate the quality of retrieval-augmented responses. This evaluation enables the decision of the most suitable search engine for a given query. By eliminating manual annotations, our approach can automatically process large-scale real user queries and create training data. We conduct extensive experiments across five datasets, demonstrating that our method significantly enhances scalability and generalization capabilities.
中文: 本文提出一种无监督方法,通过构建“上限”响应来评估检索增强生成的质量,无需人工标注即可自动选择搜索引擎并生成大规模训练数据。
English: This paper introduces an unsupervised method that constructs an upper-bound response to evaluate retrieval-augmented generation quality, enabling automatic search engine selection and scalable training data creation without manual annotations.

Authors:Jialong Wu, Wenbiao Yin, Yong Jiang, Zhenglin Wang, Zekun Xi, Runnan Fang, Linhai Zhang, Yulan He, Deyu Zhou, Pengjun Xie, Fei Huang
Title: WebWalker: Benchmarking LLMs in Web Traversal
Abstract:
Retrieval-augmented generation (RAG) demonstrates remarkable performance across tasks in open-domain question-answering. However, traditional search engines may retrieve shallow content, limiting the ability of LLMs to handle complex, multi-layered information. To address it, we introduce WebWalkerQA, a benchmark designed to assess the ability of LLMs to perform web traversal. It evaluates the capacity of LLMs to traverse a website's subpages to extract high-quality data systematically. We propose WebWalker, which is a multi-agent framework that mimics human-like web navigation through an explore-critic paradigm. Extensive experimental results show that WebWalkerQA is challenging and demonstrates the effectiveness of RAG combined with WebWalker, through the horizontal and vertical integration in real-world scenarios.
中文: 检索增强生成在开放领域问答中表现卓越,但传统搜索引擎常检索浅层内容,为此引入WebWalkerQA评估大语言模型的网页遍历能力及模拟人类导航的多智能体框架WebWalker,实验证明其在真实场景中效果显著。
English: Retrieval-augmented generation (RAG) excels in open-domain question-answering, but traditional search engines often retrieve shallow content, prompting the introduction of WebWalkerQA to evaluate LLMs' web traversal abilities and WebWalker, a multi-agent framework that mimics human navigation, proving effective in real-world scenarios.

Authors:Lingdong Kong, Xiang Xu, Youquan Liu, Jun Cen, Runnan Chen, Wenwei Zhang, Liang Pan, Kai Chen, Ziwei Liu
Title: LargeAD: Large-Scale Cross-Sensor Data Pretraining for Autonomous Driving
Abstract:
Recent advancements in vision foundation models (VFMs) have revolutionized visual perception in 2D, yet their potential for 3D scene understanding, particularly in autonomous driving applications, remains underexplored. In this paper, we introduce LargeAD, a versatile and scalable framework designed for large-scale 3D pretraining across diverse real-world driving datasets. Our framework leverages VFMs to extract semantically rich superpixels from 2D images, which are aligned with LiDAR point clouds to generate high-quality contrastive samples. This alignment facilitates cross-modal representation learning, enhancing the semantic consistency between 2D and 3D data. We introduce several key innovations: i) VFM-driven superpixel generation for detailed semantic representation, ii) a VFM-assisted contrastive learning strategy to align multimodal features, iii) superpoint temporal consistency to maintain stable representations across time, and iv) multi-source data pretraining to generalize across various LiDAR configurations. Our approach delivers significant performance improvements over state-of-the-art methods in both linear probing and fine-tuning tasks for both LiDAR-based segmentation and object detection. Extensive experiments on eleven large-scale multi-modal datasets highlight our superior performance, demonstrating the adaptability, efficiency, and robustness in real-world autonomous driving scenarios.
中文:LargeAD框架利用视觉基础模型将2D图像超像素与3D激光雷达数据对齐,通过跨模态对比学习和多数据集预训练,在自动驾驶任务中实现了最先进的性能。
English: The LargeAD framework leverages vision foundation models to align 2D image superpixels with 3D LiDAR data, achieving state-of-the-art performance in autonomous driving tasks through cross-modal contrastive learning and multi-dataset pretraining.

Authors:Lingdong Kong, Xiang Xu, Youquan Liu, Jun Cen, Runnan Chen, Wenwei Zhang, Liang Pan, Kai Chen, Ziwei Liu
Title: LargeAD: Large-Scale Cross-Sensor Data Pretraining for Autonomous Driving
Abstract:
Recent advancements in vision foundation models (VFMs) have revolutionized visual perception in 2D, yet their potential for 3D scene understanding, particularly in autonomous driving applications, remains underexplored. In this paper, we introduce LargeAD, a versatile and scalable framework designed for large-scale 3D pretraining across diverse real-world driving datasets. Our framework leverages VFMs to extract semantically rich superpixels from 2D images, which are aligned with LiDAR point clouds to generate high-quality contrastive samples. This alignment facilitates cross-modal representation learning, enhancing the semantic consistency between 2D and 3D data. We introduce several key innovations: (i) VFM-driven superpixel generation for detailed semantic representation, (ii) a VFM-assisted contrastive learning strategy to align multimodal features, (iii) superpoint temporal consistency to maintain stable representations across time, and (iv) multi-source data pretraining to generalize across various LiDAR configurations. Our approach achieves substantial gains over state-of-the-art methods in linear probing and fine-tuning for LiDAR-based segmentation and object detection. Extensive experiments on 11 large-scale multi-sensor datasets highlight our superior performance, demonstrating adaptability, efficiency, and robustness in real-world autonomous driving scenarios.
中文:LargeAD框架利用视觉基础模型将2D图像超像素与3D激光雷达数据对齐,通过跨模态对比学习和多数据集预训练,在自动驾驶任务中实现了最先进的性能。
English: The LargeAD framework leverages vision foundation models to align 2D image superpixels with 3D LiDAR data, achieving state-of-the-art performance in autonomous driving tasks through cross-modal contrastive learning and multi-dataset pretraining.

Authors:Jinchao Li, Yuejiao Wang, Junan Li, Jiawen Kang, Bo Zheng, Simon Wong, Brian Mak, Helene Fung, Jean Woo, Man-Wai Mak, Timothy Kwok, Vincent Mok, Xianmin Gong, Xixin Wu, Xunying Liu, Patrick Wong, Helen Meng
Title: Detecting Neurocognitive Disorders through Analyses of Topic Evolution and Cross-modal Consistency in Visual-Stimulated Narratives
Abstract:
Early detection of neurocognitive disorders (NCDs) is crucial for timely intervention and disease management. Given that language impairments manifest early in NCD progression, visual-stimulated narrative (VSN)-based analysis offers a promising avenue for NCD detection. Current VSN-based NCD detection methods primarily focus on linguistic microstructures (e.g., pauses, lexical diversity), which are potentially linked to bottom-up (stimulus-driven) cognitive processing. While these features illuminate basic language abilities, the higher-order linguistic macrostructures (e.g., thematic or logical development), which may reflect top-down (concept-driven) cognitive abilities, remain underexplored. These patterns are crucial for NCD detection yet challenging to quantify due to their abstract and complex nature. To bridge this gap, we propose two novel dynamic macrostructural approaches: (1) Dynamic Topic Model (DTM) to track topic evolution over time, and (2) Text-Image Temporal Alignment Network (TITAN) to measure cross-modal consistency between speech and visual stimuli. Experimental results validated the efficiency of proposed approaches in NCD detection, with TITAN achieving superior performance both on the CU-MARVEL-RABBIT corpus (F1 = 0.7238) and the ADReSS corpus (F1 = 0.8889). The feature contribution analysis revealed that macrostructural features (e.g., topic variability, topic change rate, and topic consistency) constituted the most significant contributors in the model's decision pathways, outperforming investigated microstructural features. These findings underscore the critical role of macrostructural patterns in understanding cognitive impairment mechanisms in NCDs.
中文: 本研究提出两种动态宏观结构分析方法DTM和TITAN,通过分析高阶语言模式能有效检测神经认知障碍,其性能优于传统微观结构特征,凸显了宏观模式在理解认知障碍机制中的关键作用。
English: This study introduces two novel dynamic macrostructural approaches, DTM and TITAN, which effectively detect neurocognitive disorders by analyzing higher-order linguistic patterns and demonstrate superior performance over traditional microstructural methods.

Authors:Dongyang Dai, Zhiyong Wu, Runnan Li, Xixin Wu, Jia Jia, Helen Meng
Title: learning discriminative features from spectrograms using center loss for speech emotion recognition
Abstract:
Identifying the emotional state from speech is essential for the natural interaction of the machine with the speaker. However, extracting effective features for emotion recognition is difficult, as emotions are ambiguous. We propose a novel approach to learn discriminative features from variable length spectrograms for emotion recognition by cooperating softmax cross-entropy loss and center loss together. The softmax cross-entropy loss enables features from different emotion categories separable, and center loss efficiently pulls the features belonging to the same emotion category to their center. By combining the two losses together, the discriminative power will be highly enhanced, which leads to network learning more effective features for emotion recognition. As demonstrated by the experimental results, after introducing center loss, both the unweighted accuracy and weighted accuracy are improved by over 3\% on Mel-spectrogram input, and more than 4\% on Short Time Fourier Transform spectrogram input.
中文摘要:本研究提出一种新方法,通过结合softmax交叉熵损失和中心损失从变长频谱图中学习区分性特征以提升语音情感识别效果,实验表明该方法在梅尔频谱图和短时傅里叶变换频谱图上分别实现了超过3%和4%的准确率提升。
English Summary: The study introduces a novel method that combines softmax cross-entropy loss and center loss to enhance emotion recognition from speech by learning discriminative features from spectrograms, resulting in accuracy improvements of over 3-4% in experiments.

Authors:Dongyang Dai, Zhiyong Wu, Shiyin Kang, Xixin Wu, Jia Jia, Dan Su, Dong Yu, Helen Meng
Title: Disambiguation of Chinese Polyphones in an End-to-End Framework with Semantic Features Extracted by Pre-trained BERT
Abstract:
Grapheme-to-phoneme (G2P) conversion serves as an essential component in Chinese Mandarin text-to-speech (TTS) system, where polyphone disambiguation is the core issue. In this paper, we propose an end-to-end framework to predict the pronunciation of a polyphonic character, which accepts sentence containing polyphonic character as input in the form of Chinese character sequence without the necessity of any preprocessing. The proposed method consists of a pre-trained bidirectional encoder representations from Transformers (BERT) model and a neural network (NN) based classifier. The pre-trained BERT model extracts semantic features from a raw Chinese character sequence and the NN based classifier predicts the polyphonic character's pronunciation according to BERT output. In out experiments, we implemented three classifiers, a fully-connected network based classifier, a long short-term memory (LSTM) network based classifier and a Transformer block based classifier. The experimental results compared with the baseline approach based on LSTM demonstrate that, the pre-trained model extracts effective semantic features, which greatly enhances the performance of polyphone disambiguation. In addition, we also explored the impact of contextual information on polyphone disambiguation.
中文: 本文提出了一种端到端框架,结合预训练的BERT模型和神经网络分类器,通过从原始汉字序列中有效提取语义特征,显著提升了普通话文本转语音系统中多音字消歧的性能。
English: This paper introduces an end-to-end framework combining a pre-trained BERT model with neural network classifiers to enhance polyphone disambiguation in Mandarin text-to-speech systems by effectively extracting semantic features from raw character sequences.

Authors:Weiqi Wu, Shen Huang, Yong Jiang, Pengjun Xie, Fei Huang, Hai Zhao
Title: Unfolding the Headline: Iterative Self-Questioning for News Retrieval and Timeline Summarization
Abstract:
In the fast-changing realm of information, the capacity to construct coherent timelines from extensive event-related content has become increasingly significant and challenging. The complexity arises in aggregating related documents to build a meaningful event graph around a central topic. This paper proposes CHRONOS - Causal Headline Retrieval for Open-domain News Timeline SummarizatiOn via Iterative Self-Questioning, which offers a fresh perspective on the integration of Large Language Models (LLMs) to tackle the task of Timeline Summarization (TLS). By iteratively reflecting on how events are linked and posing new questions regarding a specific news topic to gather information online or from an offline knowledge base, LLMs produce and refresh chronological summaries based on documents retrieved in each round. Furthermore, we curate Open-TLS, a novel dataset of timelines on recent news topics authored by professional journalists to evaluate open-domain TLS where information overload makes it impossible to find comprehensive relevant documents from the web. Our experiments indicate that CHRONOS is not only adept at open-domain timeline summarization, but it also rivals the performance of existing state-of-the-art systems designed for closed-domain applications, where a related news corpus is provided for summarization.
中文: 本文提出CHRONOS方法,利用大语言模型通过迭代自问构建时间线摘要,并在开放和封闭领域均展现出与顶尖系统相媲美的性能。
English: This paper introduces CHRONOS, a novel approach using Large Language Models for timeline summarization that iteratively self-questions to build chronological event summaries and demonstrates competitive performance in both open and closed domains.

Authors:Zhongwei Ren, Yunchao Wei, Xun Guo, Yao Zhao, Bingyi Kang, Jiashi Feng, Xiaojie Jin
Title: VideoWorld: Exploring Knowledge Learning from Unlabeled Videos
Abstract:
This work explores whether a deep generative model can learn complex knowledge solely from visual input, in contrast to the prevalent focus on text-based models like large language models (LLMs). We develop VideoWorld, an auto-regressive video generation model trained on unlabeled video data, and test its knowledge acquisition abilities in video-based Go and robotic control tasks. Our experiments reveal two key findings: (1) video-only training provides sufficient information for learning knowledge, including rules, reasoning and planning capabilities, and (2) the representation of visual change is crucial for knowledge acquisition. To improve both the efficiency and efficacy of this process, we introduce the Latent Dynamics Model (LDM) as a key component of VideoWorld. Remarkably, VideoWorld reaches a 5-dan professional level in the Video-GoBench with just a 300-million-parameter model, without relying on search algorithms or reward mechanisms typical in reinforcement learning. In robotic tasks, VideoWorld effectively learns diverse control operations and generalizes across environments, approaching the performance of oracle models in CALVIN and RLBench. This study opens new avenues for knowledge acquisition from visual data, with all code, data, and models open-sourced for further research.
中文: 本研究表明,仅通过未标记视频数据训练的深度生成模型VideoWorld能够有效获取规则、推理和规划等复杂知识,无需文本输入即可在围棋和机器人控制任务中达到专业水平。
English: This research demonstrates that VideoWorld, a deep generative model trained solely on unlabeled video data, can effectively acquire complex knowledge such as rules, reasoning, and planning abilities without text input, achieving professional-level performance in Go and robotic control tasks.

Authors:Yirong Zeng, Xiao Ding, Yuxian Wang, Weiwen Liu, Wu Ning, Yutai Hou, Xu Huang, Bing Qin, Ting Liu
Title: iTool: Reinforced Fine-Tuning with Dynamic Deficiency Calibration for Advanced Tool Use
Abstract:
Augmenting large language models (LLMs) with external tools is a promising approach to enhance their capabilities, especially for complex tasks. Synthesizing tool-use data through real-world simulations is an effective way to achieve this. However, our investigation reveals that training gains significantly decay as synthetic data increases. The model struggles to benefit from more synthetic data, and it can not equip the model with advanced tool-use capabilities in complex scenarios. Moreover, we discovered that the above limitation usually manifests as a fragment deficiency (i.e., parameter errors) in response. To this end, we propose an iterative reinforced fine-tuning strategy designed to alleviate this limitation. This strategy involves: (1) enhancing the diversity of response for synthetic data through path exploration of Monte Carlo Tree Search. (2) iteratively pinpointing the model's deficiency by constructing fine-grained preference pairs, and then improving it by preference optimization algorithms for targeted improvement. The experiments show that our method achieves 13.11% better performance than the same-size base model. It achieves an improvement of 6.5% in complex scenarios compared to the baseline, and it also outperforms larger open-source and closed-source models.
Chinese: 通过合成数据增强大语言模型的外部工具使用面临收益递减问题,但采用蒙特卡洛树搜索路径探索和偏好优化的迭代强化微调策略,使性能比基准模型提升13.11%,在复杂场景中表现尤为突出。
English: Augmenting LLMs with external tools via synthetic data faces diminishing returns, but an iterative reinforced fine-tuning strategy using Monte Carlo Tree Search and preference optimization improves performance by 13.11% over base models and excels in complex scenarios.

Authors:Yue Wang, Qiuzhi Liu, Jiahao Xu, Tian Liang, Xingyu Chen, Zhiwei He, Linfeng Song, Dian Yu, Juntao Li, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, Dong Yu
Title: Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs
Abstract:
Large language models (LLMs) such as OpenAI's o1 have demonstrated remarkable abilities in complex reasoning tasks by scaling test-time compute and exhibiting human-like deep thinking. However, we identify a phenomenon we term underthinking, where o1-like LLMs frequently switch between different reasoning thoughts without sufficiently exploring promising paths to reach a correct solution. This behavior leads to inadequate depth of reasoning and decreased performance, particularly on challenging mathematical problems. To systematically analyze this issue, we conduct experiments on three challenging test sets and two representative open-source o1-like models, revealing that frequent thought switching correlates with incorrect responses. We introduce a novel metric to quantify underthinking by measuring token efficiency in incorrect answers. To address underthinking, we propose a decoding strategy with thought switching penalty TIP that discourages premature transitions between thoughts, encouraging deeper exploration of each reasoning path. Experimental results demonstrate that our approach improves accuracy across challenging datasets without requiring model fine-tuning. Our findings contribute to understanding reasoning inefficiencies in o1-like LLMs and offer a practical solution to enhance their problem-solving capabilities.
中文摘要:研究发现o1类大语言模型存在"思考不足"现象,即过早切换思路导致推理深度不足,并提出通过惩罚性解码策略在不微调模型的情况下有效提升准确率。
English Summary: The study identifies "underthinking" in o1-like LLMs where premature thought switching hinders deep reasoning, and proposes a penalty-based decoding strategy that improves accuracy without fine-tuning.

Authors:Yuechen Yang, Yu Wang, Tianyuan Yao, Ruining Deng, Mengmeng Yin, Shilin Zhao, Haichun Yang, Yuankai Huo
Title: PySpatial: A High-Speed Whole Slide Image Pathomics Toolkit
Abstract:
Whole Slide Image (WSI) analysis plays a crucial role in modern digital pathology, enabling large-scale feature extraction from tissue samples. However, traditional feature extraction pipelines based on tools like CellProfiler often involve lengthy workflows, requiring WSI segmentation into patches, feature extraction at the patch level, and subsequent mapping back to the original WSI. To address these challenges, we present PySpatial, a high-speed pathomics toolkit specifically designed for WSI-level analysis. PySpatial streamlines the conventional pipeline by directly operating on computational regions of interest, reducing redundant processing steps. Utilizing rtree-based spatial indexing and matrix-based computation, PySpatial efficiently maps and processes computational regions, significantly accelerating feature extraction while maintaining high accuracy. Our experiments on two datasets-Perivascular Epithelioid Cell (PEC) and data from the Kidney Precision Medicine Project (KPMP)-demonstrate substantial performance improvements. For smaller and sparse objects in PEC datasets, PySpatial achieves nearly a 10-fold speedup compared to standard CellProfiler pipelines. For larger objects, such as glomeruli and arteries in KPMP datasets, PySpatial achieves a 2-fold speedup. These results highlight PySpatial's potential to handle large-scale WSI analysis with enhanced efficiency and accuracy, paving the way for broader applications in digital pathology.
Chinese: PySpatial 是一种高速病理组学工具包,通过直接处理计算感兴趣区域来简化全切片图像分析流程,在保持精度的同时实现了高达10倍的加速效果。
English: PySpatial is a high-speed pathomics toolkit that streamlines whole slide image analysis by directly processing computational regions of interest, achieving up to a 10-fold speedup while maintaining accuracy.

Authors:Roberto Daza, Lin Shengkai, Aythami Morales, Julian Fierrez, Katashi Nagao
Title: SMARTe-VR: Student Monitoring and Adaptive Response Technology for e-Learning in Virtual Reality
Abstract:
This work introduces SMARTe-VR, a platform for student monitoring in an immersive virtual reality environment designed for online education. SMARTe-VR aims to collect data for adaptive learning, focusing on facial biometrics and learning metadata. The platform allows instructors to create customized learning sessions with video lectures, featuring an interface with an AutoQA system to evaluate understanding, interaction tools (for example, textbook highlighting and lecture tagging), and real-time feedback. Furthermore, we released a dataset that contains 5 research challenges with data from 10 users in VR-based TOEIC sessions. This data set, which spans more than 25 hours, includes facial features, learning metadata, 450 responses, difficulty levels of the questions, concept tags, and understanding labels. Alongside the database, we present preliminary experiments using Item Response Theory models, adapted for understanding detection using facial features. Two architectures were explored: a Temporal Convolutional Network for local features and a Multilayer Perceptron for global features.
中文: 本研究介绍了SMARTe-VR虚拟现实平台,它通过面部生物特征和学习元数据监控在线教育中的学生以实现自适应学习,并发布了基于VR的托业课程数据集及采用项目反应理论模型进行理解检测的初步实验。
English: This study presents SMARTe-VR, a virtual reality platform for online education that monitors students through facial biometrics and learning metadata to enable adaptive learning, and it also releases a comprehensive dataset from VR-based TOEIC sessions along with preliminary experiments using Item Response Theory models for understanding detection.

Authors:Jin Chen, Jin Zhang, Xu huang, Yi Yang, Defu Lian, Enhong Chen
Title: Adaptive Sampled Softmax with Inverted Multi-Index: Methods, Theory and Applications
Abstract:
The softmax function is a cornerstone of multi-class classification, integral to a wide range of machine learning applications, from large-scale retrieval and ranking models to advanced large language models. However, its computational cost grows linearly with the number of classes, which becomes prohibitively expensive in scenarios with millions or even billions of classes. The sampled softmax, which relies on self-normalized importance sampling, has emerged as a powerful alternative, significantly reducing computational complexity. Yet, its estimator remains unbiased only when the sampling distribution matches the true softmax distribution. To improve both approximation accuracy and sampling efficiency, we propose the MIDX Sampler, a novel adaptive sampling strategy based on an inverted multi-index approach. Concretely, we decompose the softmax probability into several multinomial probabilities, each associated with a specific set of codewords and the last associated with the residual score of queries, thus reducing time complexity to the number of codewords instead of the number of classes. To further boost efficiency, we replace the query-specific residual probability with a simple uniform distribution, simplifying the computation while retaining high performance. Our method is backed by rigorous theoretical analysis, addressing key concerns such as sampling bias, gradient bias, convergence rates, and generalization error bounds. The results demonstrate that a smaller divergence from the ideal softmax distribution leads to faster convergence and improved generalization. Extensive experiments on large-scale language models, sequential recommenders, and extreme multi-class classification tasks confirm that the MIDX-Sampler delivers superior effectiveness and efficiency compared to existing approaches.
中文摘要:MIDX采样器通过分解概率分布并采用均匀残差分布的新型自适应采样策略,显著降低了softmax计算复杂度,在保证理论严谨性的同时实现了更快的收敛速度和更优的泛化性能。
English Summary: The MIDX Sampler is an innovative adaptive sampling method that reduces softmax computational complexity by decomposing probabilities and using a uniform residual distribution, achieving faster convergence and better generalization with strong theoretical guarantees.

Authors:Longzhong Lin, Xuewu Lin, Kechun Xu, Haojian Lu, Lichao Huang, Rong Xiong, Yue Wang
Title: Revisit Mixture Models for Multi-Agent Simulation: Experimental Study within a Unified Framework
Abstract:
Simulation plays a crucial role in assessing autonomous driving systems, where the generation of realistic multi-agent behaviors is a key aspect. In multi-agent simulation, the primary challenges include behavioral multimodality and closed-loop distributional shifts. In this study, we revisit mixture models for generating multimodal agent behaviors, which can cover the mainstream methods including continuous mixture models and GPT-like discrete models. Furthermore, we introduce a closed-loop sample generation approach tailored for mixture models to mitigate distributional shifts. Within the unified mixture model~(UniMM) framework, we recognize critical configurations from both model and data perspectives. We conduct a systematic examination of various model configurations, including positive component matching, continuous regression, prediction horizon, and the number of components. Moreover, our investigation into the data configuration highlights the pivotal role of closed-loop samples in achieving realistic simulations. To extend the benefits of closed-loop samples across a broader range of mixture models, we further address the shortcut learning and off-policy learning issues. Leveraging insights from our exploration, the distinct variants proposed within the UniMM framework, including discrete, anchor-free, and anchor-based models, all achieve state-of-the-art performance on the WOSAC benchmark.
中文: 本研究提出了一种统一混合模型框架,通过闭环采样生成自动驾驶仿真中的多模态多智能体行为,有效解决了行为多样性和分布偏移问题,并在WOSAC基准测试中取得了最优性能。
English: This study introduces a unified mixture model framework for generating realistic multi-agent behaviors in autonomous driving simulations, addressing multimodality and distributional shifts through closed-loop sampling and achieving state-of-the-art results on the WOSAC benchmark.

Authors:Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Wenyu Qin, Menghan Xia, Xintao Wang, Xiaohong Liu, Fei Yang, Pengfei Wan, Di Zhang, Kun Gai, Yujiu Yang, Wanli Ouyang
Title: Improving Video Generation with Human Feedback
Abstract:
Video generation has achieved significant advances through rectified flow techniques, but issues like unsmooth motion and misalignment between videos and prompts persist. In this work, we develop a systematic pipeline that harnesses human feedback to mitigate these problems and refine the video generation model. Specifically, we begin by constructing a large-scale human preference dataset focused on modern video generation models, incorporating pairwise annotations across multi-dimensions. We then introduce VideoReward, a multi-dimensional video reward model, and examine how annotations and various design choices impact its rewarding efficacy. From a unified reinforcement learning perspective aimed at maximizing reward with KL regularization, we introduce three alignment algorithms for flow-based models by extending those from diffusion models. These include two training-time strategies: direct preference optimization for flow (Flow-DPO) and reward weighted regression for flow (Flow-RWR), and an inference-time technique, Flow-NRG, which applies reward guidance directly to noisy videos. Experimental results indicate that VideoReward significantly outperforms existing reward models, and Flow-DPO demonstrates superior performance compared to both Flow-RWR and standard supervised fine-tuning methods. Additionally, Flow-NRG lets users assign custom weights to multiple objectives during inference, meeting personalized video quality needs. Project page: https://gongyeliu.github.io/videoalign.
中文摘要:本研究通过构建大规模人类偏好数据集、开发VideoReward奖励模型,并实施三种对齐算法,建立了一个利用人类反馈的系统化流程,显著提升了视频生成的流畅度和提示词对齐效果。
English Summary: This study introduces a systematic pipeline using human feedback to enhance video generation by creating a large-scale preference dataset, developing the VideoReward model, and implementing three alignment algorithms that significantly improve motion smoothness and prompt alignment.

Authors:Jiwen Yu, Yiran Qin, Xintao Wang, Pengfei Wan, Di Zhang, Xihui Liu
Title: GameFactory: Creating New Games with Generative Interactive Videos
Abstract:
Generative videos have the potential to revolutionize game development by autonomously creating new content. In this paper, we present GameFactory, a framework for action-controlled scene-generalizable game video generation. We first address the fundamental challenge of action controllability by introducing GF-Minecraft, an action-annotated game video dataset without human bias, and developing an action control module that enables precise control over both keyboard and mouse inputs. We further extend to support autoregressive generation for unlimited-length interactive videos. More importantly, GameFactory tackles the critical challenge of scene-generalizable action control, which most existing methods fail to address. To enable the creation of entirely new and diverse games beyond fixed styles and scenes, we leverage the open-domain generative priors from pre-trained video diffusion models. To bridge the domain gap between open-domain priors and small-scale game datasets, we propose a multi-phase training strategy with a domain adapter that decouples game style learning from action control. This decoupling ensures that action control learning is no longer bound to specific game styles, thereby achieving scene-generalizable action control. Experimental results demonstrate that GameFactory effectively generates open-domain action-controllable game videos, representing a significant step forward in AI-driven game generation.
中文摘要:GameFactory通过利用开放域生成先验和多阶段训练策略,提出了一个能实现动作可控和场景泛化的游戏视频生成框架,突破了固定游戏风格限制,可创建无限长度的交互式内容。
English Summary: GameFactory introduces a framework for generating action-controllable and scene-generalizable game videos by leveraging open-domain priors and a multi-phase training strategy, enabling unlimited-length interactive content creation beyond fixed game styles.

Authors:Yuzhou Huang, Ziyang Yuan, Quande Liu, Qiulin Wang, Xintao Wang, Ruimao Zhang, Pengfei Wan, Di Zhang, Kun Gai
Title: ConceptMaster: Multi-Concept Video Customization on Diffusion Transformer Models Without Test-Time Tuning
Abstract:
Text-to-video generation has made remarkable advancements through diffusion models. However, Multi-Concept Video Customization (MCVC) remains a significant challenge. We identify two key challenges for this task: 1) the identity decoupling issue, where directly adopting existing customization methods inevitably mix identity attributes when handling multiple concepts simultaneously, and 2) the scarcity of high-quality video-entity pairs, which is crucial for training a model that can well represent and decouple various customized concepts in video generation. To address these challenges, we introduce ConceptMaster, a novel framework that effectively addresses the identity decoupling issues while maintaining concept fidelity in video customization. Specifically, we propose to learn decoupled multi-concept embeddings and inject them into diffusion models in a standalone manner, which effectively guarantees the quality of customized videos with multiple identities, even for highly similar visual concepts. To overcome the scarcity of high-quality MCVC data, we establish a data construction pipeline, which enables collection of high-quality multi-concept video-entity data pairs across diverse scenarios. A multi-concept video evaluation set is further devised to comprehensively validate our method from three dimensions, including concept fidelity, identity decoupling ability, and video generation quality, across six different concept composition scenarios. Extensive experiments demonstrate that ConceptMaster significantly outperforms previous methods for video customization tasks, showing great potential to generate personalized and semantically accurate content for video diffusion models.
中文摘要:ConceptMaster通过解耦多概念嵌入学习和构建高质量视频实体对,有效解决了多概念视频定制中的身份解耦问题和数据稀缺问题,显著优于现有方法。
English Summary: ConceptMaster is a novel framework that overcomes identity decoupling issues and data scarcity in multi-concept video customization by learning decoupled embeddings and constructing high-quality video-entity pairs, significantly outperforming previous methods.

Authors:Xuan Yu, Yuxuan Xie, Yili Liu, Haojian Lu, Rong Xiong, Yiyi Liao, Yue Wang
Title: Leverage Cross-Attention for End-to-End Open-Vocabulary Panoptic Reconstruction
Abstract:
Open-vocabulary panoptic reconstruction offers comprehensive scene understanding, enabling advances in embodied robotics and photorealistic simulation. In this paper, we propose PanopticRecon++, an end-to-end method that formulates panoptic reconstruction through a novel cross-attention perspective. This perspective models the relationship between 3D instances (as queries) and the scene's 3D embedding field (as keys) through their attention map. Unlike existing methods that separate the optimization of queries and keys or overlook spatial proximity, PanopticRecon++ introduces learnable 3D Gaussians as instance queries. This formulation injects 3D spatial priors to preserve proximity while maintaining end-to-end optimizability. Moreover, this query formulation facilitates the alignment of 2D open-vocabulary instance IDs across frames by leveraging optimal linear assignment with instance masks rendered from the queries. Additionally, we ensure semantic-instance segmentation consistency by fusing query-based instance segmentation probabilities with semantic probabilities in a novel panoptic head supervised by a panoptic loss. During training, the number of instance query tokens dynamically adapts to match the number of objects. PanopticRecon++ shows competitive performance in terms of 3D and 2D segmentation and reconstruction performance on both simulation and real-world datasets, and demonstrates a user case as a robot simulator. Our project website is at: https://yuxuan1206.github.io/panopticrecon_pp/
中文:PanopticRecon++ 提出了一种端到端的全景重建方法,通过可学习的3D高斯模型作为实例查询,结合空间先验和交叉注意力机制,有效提升了机器人技术和仿真中的场景理解能力。
English: PanopticRecon++ introduces an end-to-end method for panoptic reconstruction using learnable 3D Gaussians as instance queries, which integrates spatial priors and cross-attention to enhance scene understanding for robotics and simulation.

Authors:Jing Zhang, Yanjun Lyu, Xiaowei Yu, Lu Zhang, Chao Cao, Tong Chen, Minheng Chen, Yan Zhuang, Tianming Liu, Dajiang Zhu
Title: Classification of Mild Cognitive Impairment Based on Dynamic Functional Connectivity Using Spatio-Temporal Transformer
Abstract:
Dynamic functional connectivity (dFC) using resting-state functional magnetic resonance imaging (rs-fMRI) is an advanced technique for capturing the dynamic changes of neural activities, and can be very useful in the studies of brain diseases such as Alzheimer's disease (AD). Yet, existing studies have not fully leveraged the sequential information embedded within dFC that can potentially provide valuable information when identifying brain conditions. In this paper, we propose a novel framework that jointly learns the embedding of both spatial and temporal information within dFC based on the transformer architecture. Specifically, we first construct dFC networks from rs-fMRI data through a sliding window strategy. Then, we simultaneously employ a temporal block and a spatial block to capture higher-order representations of dynamic spatio-temporal dependencies, via mapping them into an efficient fused feature representation. To further enhance the robustness of these feature representations by reducing the dependency on labeled data, we also introduce a contrastive learning strategy to manipulate different brain states. Experimental results on 345 subjects with 570 scans from the Alzheimer's Disease Neuroimaging Initiative (ADNI) demonstrate the superiority of our proposed method for MCI (Mild Cognitive Impairment, the prodromal stage of AD) prediction, highlighting its potential for early identification of AD.
中文:本文提出了一种基于Transformer架构的新颖框架,通过联合学习动态功能连接的时空信息嵌入,结合对比学习策略,在ADNI数据集上验证了该方法在轻度认知障碍预测方面的优越性,为阿尔茨海默病的早期识别提供了有效方案。
English: This paper introduces a novel transformer-based framework that jointly captures spatial and temporal information from dynamic functional connectivity (dFC) for enhanced Mild Cognitive Impairment (MCI) prediction, demonstrating superior performance in early Alzheimer's disease identification through experiments on ADNI data.

Authors:Jing Zhang, Xiaowei Yu, Yanjun Lyu, Lu Zhang, Tong Chen, Chao Cao, Yan Zhuang, Minheng Chen, Tianming Liu, Dajiang Zhu
Title: Brain-Adapter: Enhancing Neurological Disorder Analysis with Adapter-Tuning Multimodal Large Language Models
Abstract:
Understanding brain disorders is crucial for accurate clinical diagnosis and treatment. Recent advances in Multimodal Large Language Models (MLLMs) offer a promising approach to interpreting medical images with the support of text descriptions. However, previous research has primarily focused on 2D medical images, leaving richer spatial information of 3D images under-explored, and single-modality-based methods are limited by overlooking the critical clinical information contained in other modalities. To address this issue, this paper proposes Brain-Adapter, a novel approach that incorporates an extra bottleneck layer to learn new knowledge and instill it into the original pre-trained knowledge. The major idea is to incorporate a lightweight bottleneck layer to train fewer parameters while capturing essential information and utilize a Contrastive Language-Image Pre-training (CLIP) strategy to align multimodal data within a unified representation space. Extensive experiments demonstrated the effectiveness of our approach in integrating multimodal data to significantly improve the diagnosis accuracy without high computational costs, highlighting the potential to enhance real-world diagnostic workflows.
中文摘要:本文提出Brain-Adapter方法,通过轻量级瓶颈层和对比学习策略整合多模态3D医疗数据,在保持计算效率的同时显著提升了脑部疾病的诊断准确性。
English Summary: This paper introduces Brain-Adapter, a novel method using a lightweight bottleneck layer and CLIP strategy to integrate multimodal 3D medical imaging with text data, significantly improving diagnostic accuracy while maintaining computational efficiency.

Authors:Jialun Cao, Yaojie Lu, Meiziniu Li, Haoyang Ma, Haokun Li, Mengda He, Cheng Wen, Le Sun, Hongyu Zhang, Shengchao Qin, Shing-Chi Cheung, Cong Tian
Title: From Informal to Formal -- Incorporating and Evaluating LLMs on Natural Language Requirements to Verifiable Formal Proofs
Abstract:
The research in AI-based formal mathematical reasoning has shown an unstoppable growth trend. These studies have excelled in mathematical competitions like IMO and have made significant progress. This paper focuses on formal verification, an immediate application scenario of formal reasoning, and breaks it down into sub-tasks. We constructed 18k high-quality instruction-response pairs across five formal specification languages (Coq, Lean4, Dafny, ACSL, and TLA+) by distilling gpt-4o and evaluated against ten open-sourced LLMs, including recent popular DeepSeek-R1. We also fine-tuned several 7~8B small models to achieve comparable performance with Deepseek-R1-671B. Interestingly, we observed that fine-tuning with formal data also enhances mathematics, reasoning, and coding capabilities. Fine-tuned models are released at https: //huggingface.co/fm-universe.
中文: 本研究通过构建五种形式化语言的1.8万条指令-响应对,推进了基于AI的形式数学推理,证明微调小模型不仅能媲美超大模型性能,还能提升数学推理与编程能力。
English: This study advances AI-based formal mathematical reasoning by creating 18,000 instruction-response pairs across five formal languages and demonstrates that fine-tuning smaller models can match the performance of much larger ones while also improving broader mathematical and coding skills.

Authors:Sankalp KJ, Ashutosh Kumar, Laxmaan Balaji, Nikunj Kotecha, Vinija Jain, Aman Chadha, Sreyoshi Bhaduri
Title: IndicMMLU-Pro: Benchmarking Indic Large Language Models on Multi-Task Language Understanding
Abstract:
Known by more than 1.5 billion people in the Indian subcontinent, Indic languages present unique challenges and opportunities for natural language processing (NLP) research due to their rich cultural heritage, linguistic diversity, and complex structures. IndicMMLU-Pro is a comprehensive benchmark designed to evaluate Large Language Models (LLMs) across Indic languages, building upon the MMLU Pro (Massive Multitask Language Understanding) framework. Covering major languages such as Hindi, Bengali, Gujarati, Marathi, Kannada, Punjabi, Tamil, Telugu, and Urdu, our benchmark addresses the unique challenges and opportunities presented by the linguistic diversity of the Indian subcontinent. This benchmark encompasses a wide range of tasks in language comprehension, reasoning, and generation, meticulously crafted to capture the intricacies of Indian languages. IndicMMLU-Pro provides a standardized evaluation framework to push the research boundaries in Indic language AI, facilitating the development of more accurate, efficient, and culturally sensitive models. This paper outlines the benchmarks' design principles, task taxonomy, and data collection methodology, and presents baseline results from state-of-the-art multilingual models.
中文: IndicMMLU-Pro 是一个全面的基准测试,用于评估跨主要印度语言的大型语言模型,应对其语言多样性和复杂结构,以推动具有文化敏感性的AI模型研究。
English: IndicMMLU-Pro is a comprehensive benchmark for evaluating Large Language Models across major Indic languages, addressing their linguistic diversity and complex structures to advance AI research with culturally sensitive models.

Authors:Hao Fang, Xiaohang Sui, Hongyao Yu, Kuofeng Gao, Jiawei Kong, Sijin Yu, Bin Chen, Hao Wu, Shu-Tao Xia
Title: Retrievals Can Be Detrimental: A Contrastive Backdoor Attack Paradigm on Retrieval-Augmented Diffusion Models
Abstract:
Diffusion models (DMs) have recently demonstrated remarkable generation capability. However, their training generally requires huge computational resources and large-scale datasets. To solve these, recent studies empower DMs with the advanced Retrieval-Augmented Generation (RAG) technique and propose retrieval-augmented diffusion models (RDMs). By incorporating rich knowledge from an auxiliary database, RAG enhances diffusion models' generation and generalization ability while significantly reducing model parameters. Despite the great success, RAG may introduce novel security issues that warrant further investigation. In this paper, we reveal that the RDM is susceptible to backdoor attacks by proposing a multimodal contrastive attack approach named BadRDM. Our framework fully considers RAG's characteristics and is devised to manipulate the retrieved items for given text triggers, thereby further controlling the generated contents. Specifically, we first insert a tiny portion of images into the retrieval database as target toxicity surrogates. Subsequently, a malicious variant of contrastive learning is adopted to inject backdoors into the retriever, which builds shortcuts from triggers to the toxicity surrogates. Furthermore, we enhance the attacks through novel entropy-based selection and generative augmentation strategies that can derive better toxicity surrogates. Extensive experiments on two mainstream tasks demonstrate the proposed BadRDM achieves outstanding attack effects while preserving the model's benign utility.
中文:研究者提出BadRDM,一种针对检索增强扩散模型的后门攻击方法,通过对比学习和生成策略操纵检索项,在保持模型正常功能的同时实现高效攻击。
English: Researchers propose BadRDM, a backdoor attack method for retrieval-augmented diffusion models that manipulates retrieved items through contrastive learning and generative strategies, achieving effective attacks without compromising normal functionality.

Authors:Haokun Zhao, Jinyi Han, Jiaqing Liang, Yanghua Xiao, Xiaojun Meng, Jiansheng Wei
Title: CDS: Knowledge Component-Driven Data Synthesis Guided by Cognitive Diagnosis Theory
Abstract:
Large Language Models (LLMs) have achieved significant advancements, but the increasing complexity of tasks and higher performance demands highlight the need for continuous improvement. Some approaches utilize synthetic data generated by advanced LLMs based on evaluation results to train models. However, conventional evaluation methods fail to provide detailed, fine-grained profiles of LLMs, limiting their guidance for data synthesis. In this paper, we introduce the Cognitive Diagnostic Synthesis (CDS) method, which incorporates a diagnostic process inspired by Cognitive Diagnosis Theory (CDT) to refine evaluation results and characterize model profiles at the knowledge component level. Based on these diagnostics, we propose two diagnosis-synthesis strategies for weakness-targeted data synthesis. Additionally, we present an enhanced data augmentation and selection pipeline to improve the quality and diversity of synthesized data. Our experiments with several open-source models show significant improvements across multiple benchmarks, achieving up to 6.00% improvement in code generation, 13.10% in mathematical reasoning, and 5.43% in academic exams. Code and data are available on GitHub.
中文摘要:认知诊断合成(CDS)方法通过细粒度诊断和针对性数据合成策略,显著提升了大型语言模型在多项基准测试中的性能表现。
English Summary: The Cognitive Diagnostic Synthesis (CDS) method enhances LLM evaluation by providing fine-grained diagnostic profiles and targeted data synthesis strategies, achieving significant performance improvements across various benchmarks.

Authors:Amitava Das, Suranjana Trivedy, Danush Khanna, Rajarshi Roy, Gurpreet Singh, Basab Ghosh, Yaswanth Narsupalli, Vinija Jain, Vasu Sharma, Aishwarya Naresh Reganti, Aman Chadha
Title: DPO Kernels: A Semantically-Aware, Kernel-Enhanced, and Divergence-Rich Paradigm for Direct Preference Optimization
Abstract:
The rapid rise of large language models (LLMs) has unlocked many applications but also underscores the challenge of aligning them with diverse values and preferences. Direct Preference Optimization (DPO) is central to alignment but constrained by fixed divergences and limited feature transformations. We propose DPO-Kernels, which integrates kernel methods to address these issues through four key contributions: (i) Kernelized Representations with polynomial, RBF, Mahalanobis, and spectral kernels for richer transformations, plus a hybrid loss combining embedding-based and probability-based objectives; (ii) Divergence Alternatives (Jensen-Shannon, Hellinger, Renyi, Bhattacharyya, Wasserstein, and f-divergences) for greater stability; (iii) Data-Driven Selection metrics that automatically choose the best kernel-divergence pair; and (iv) a Hierarchical Mixture of Kernels for both local precision and global modeling. Evaluations on 12 datasets demonstrate state-of-the-art performance in factuality, safety, reasoning, and instruction following. Grounded in Heavy-Tailed Self-Regularization, DPO-Kernels maintains robust generalization for LLMs, offering a comprehensive resource for further alignment research.
中文摘要:DPO-Kernels通过整合核方法提供更丰富的特征变换和多样化散度度量,显著提升大语言模型的校准效果,在多项基准测试中实现最优性能并保持强泛化能力。
English Summary: DPO-Kernels enhances LLM alignment by integrating kernel methods for richer feature transformations and diverse divergence metrics, achieving superior performance across multiple benchmarks while ensuring robust generalization.

Authors:Yiming Huang, Beilei Cui, Long Bai, Zhen Chen, Jinlin Wu, Zhen Li, Hongbin Liu, Hongliang Ren
Title: Advancing Dense Endoscopic Reconstruction with Gaussian Splatting-driven Surface Normal-aware Tracking and Mapping
Abstract:
Simultaneous Localization and Mapping (SLAM) is essential for precise surgical interventions and robotic tasks in minimally invasive procedures. While recent advancements in 3D Gaussian Splatting (3DGS) have improved SLAM with high-quality novel view synthesis and fast rendering, these systems struggle with accurate depth and surface reconstruction due to multi-view inconsistencies. Simply incorporating SLAM and 3DGS leads to mismatches between the reconstructed frames. In this work, we present Endo-2DTAM, a real-time endoscopic SLAM system with 2D Gaussian Splatting (2DGS) to address these challenges. Endo-2DTAM incorporates a surface normal-aware pipeline, which consists of tracking, mapping, and bundle adjustment modules for geometrically accurate reconstruction. Our robust tracking module combines point-to-point and point-to-plane distance metrics, while the mapping module utilizes normal consistency and depth distortion to enhance surface reconstruction quality. We also introduce a pose-consistent strategy for efficient and geometrically coherent keyframe sampling. Extensive experiments on public endoscopic datasets demonstrate that Endo-2DTAM achieves an RMSE of $1.87\pm 0.63$ mm for depth reconstruction of surgical scenes while maintaining computationally efficient tracking, high-quality visual appearance, and real-time rendering. Our code will be released at github.com/lastbasket/Endo-2DTAM.
中文: Endo-2DTAM 提出了一种基于二维高斯泼溅的实时内窥镜SLAM系统,通过表面法向感知模块实现了手术场景的精确深度重建和高质量实时渲染。
English: Endo-2DTAM introduces a real-time endoscopic SLAM system using 2D Gaussian Splatting with surface normal-aware modules, achieving precise depth reconstruction and high-quality rendering in surgical scenes.

Authors:Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, Xuelong Li
Title: SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model
Abstract:
In this paper, we claim that spatial understanding is the keypoint in robot manipulation, and propose SpatialVLA to explore effective spatial representations for the robot foundation model. Specifically, we introduce Ego3D Position Encoding to inject 3D information into the input observations of the visual-language-action model, and propose Adaptive Action Grids to represent spatial robot movement actions with adaptive discretized action grids, facilitating learning generalizable and transferrable spatial action knowledge for cross-robot control. SpatialVLA is first pre-trained on top of a vision-language model with 1.1 Million real-world robot episodes, to learn a generalist manipulation policy across multiple robot environments and tasks. After pre-training, SpatialVLA is directly applied to perform numerous tasks in a zero-shot manner. The superior results in both simulation and real-world robots demonstrate its advantage of inferring complex robot motion trajectories and its strong in-domain multi-task generalization ability. We further show the proposed Adaptive Action Grids offer a new and effective way to fine-tune the pre-trained SpatialVLA model for new simulation and real-world setups, where the pre-learned action grids are re-discretized to capture robot-specific spatial action movements of new setups. The superior results from extensive evaluations demonstrate the exceptional in-distribution generalization and out-of-distribution adaptation capability, highlighting the crucial benefit of the proposed spatial-aware representations for generalist robot policy learning. All the details and codes will be open-sourced.
中文摘要:本文提出SpatialVLA机器人基础模型,通过引入Ego3D位置编码和自适应动作网格来增强空间理解能力,在仿真与真实机器人实验中均展现出卓越的跨机器人控制与泛化性能。
English Summary: This paper introduces SpatialVLA, a robot foundation model that enhances spatial understanding through Ego3D Position Encoding and Adaptive Action Grids, achieving superior cross-robot control and generalization in both simulation and real-world applications.

Authors:Guankun Wang, Long Bai, Junyi Wang, Kun Yuan, Zhen Li, Tianxu Jiang, Xiting He, Jinlin Wu, Zhen Chen, Zhen Lei, Hongbin Liu, Jiazheng Wang, Fan Zhang, Nicolas Padoy, Nassir Navab, Hongliang Ren
Title: EndoChat: Grounded Multimodal Large Language Model for Endoscopic Surgery
Abstract:
Recently, Multimodal Large Language Models (MLLMs) have demonstrated their immense potential in computer-aided diagnosis and decision-making. In the context of robotic-assisted surgery, MLLMs can serve as effective tools for surgical training and guidance. However, there is still a lack of MLLMs specialized for surgical scene understanding in clinical applications. In this work, we introduce EndoChat to address various dialogue paradigms and subtasks in surgical scene understanding that surgeons encounter. To train our EndoChat, we construct the Surg-396K dataset through a novel pipeline that systematically extracts surgical information and generates structured annotations based on collected large-scale endoscopic surgery datasets. Furthermore, we introduce a multi-scale visual token interaction mechanism and a visual contrast-based reasoning mechanism to enhance the model's representation learning and reasoning capabilities. Our model achieves state-of-the-art performance across five dialogue paradigms and eight surgical scene understanding tasks. Additionally, we conduct evaluations with professional surgeons, most of whom provide positive feedback on collaborating with EndoChat. Overall, these results demonstrate that our EndoChat has great potential to significantly advance training and automation in robotic-assisted surgery.
中文: EndoChat作为专用于手术场景理解的多模态大语言模型,通过Surg-396K数据集和创新的视觉机制,在多种对话范式与任务中实现最优性能,展现出推动机器人辅助手术培训与自动化的巨大潜力。
English: EndoChat, a specialized MLLM for surgical scene understanding, achieves state-of-the-art performance across multiple dialogue paradigms and tasks by leveraging the Surg-396K dataset and novel visual mechanisms, demonstrating significant potential for advancing robotic-assisted surgery training and automation.

Authors:Xiaoyu Li, Yingyu Liang, Jiangxuan Long, Zhenmei Shi, Zhao Song, Zhen Zhuang
Title: Neural Algorithmic Reasoning for Hypergraphs with Looped Transformers
Abstract:
Looped Transformers have shown exceptional neural algorithmic reasoning capability in simulating traditional graph algorithms, but their application to more complex structures like hypergraphs remains underexplored. Hypergraphs generalize graphs by modeling higher-order relationships among multiple entities, enabling richer representations but introducing significant computational challenges. In this work, we extend the Loop Transformer architecture's neural algorithmic reasoning capability to simulate hypergraph algorithms, addressing the gap between neural networks and combinatorial optimization over hypergraphs. Specifically, we propose a novel degradation mechanism for reducing hypergraphs to graph representations, enabling the simulation of graph-based algorithms, such as Dijkstra's shortest path. Furthermore, we introduce a hyperedge-aware encoding scheme to simulate hypergraph-specific algorithms, exemplified by Helly's algorithm. We establish theoretical guarantees for these simulations, demonstrating the feasibility of processing high-dimensional and combinatorial data using Loop Transformers. This work highlights the potential of Transformers as general-purpose algorithmic solvers for structured data.
Chinese: 本研究通过降维机制和超边感知编码,将循环Transformer扩展至超图算法模拟,填补了神经网络与超图组合优化之间的空白,并提供了理论保证。
English: This study extends Loop Transformers to simulate hypergraph algorithms through a degradation mechanism and hyperedge-aware encoding, bridging neural networks with hypergraph-based combinatorial optimization while providing theoretical guarantees.

Authors:Yuefan Cao, Chengyue Gong, Xiaoyu Li, Yingyu Liang, Zhizhou Sha, Zhenmei Shi, Zhao Song
Title: RichSpace: Enriching Text-to-Video Prompt Space via Text Embedding Interpolation
Abstract:
Text-to-video generation models have made impressive progress, but they still struggle with generating videos with complex features. This limitation often arises from the inability of the text encoder to produce accurate embeddings, which hinders the video generation model. In this work, we propose a novel approach to overcome this challenge by selecting the optimal text embedding through interpolation in the embedding space. We demonstrate that this method enables the video generation model to produce the desired videos. Additionally, we introduce a simple algorithm using perpendicular foot embeddings and cosine similarity to identify the optimal interpolation embedding. Our findings highlight the importance of accurate text embeddings and offer a pathway for improving text-to-video generation performance.
中文: 本研究提出了一种在文本嵌入空间中进行插值的新方法,通过选择最优嵌入来改进文本到视频生成,使模型能更有效地生成具有复杂特征的视频。
English: This study introduces a novel interpolation method in the text embedding space to enhance text-to-video generation by selecting optimal embeddings, thereby enabling models to produce videos with complex features more effectively.

Authors:Xiaoyu Li, Yingyu Liang, Zhenmei Shi, Zhao Song, Wei Wang, Jiahao Zhang
Title: On the Computational Capability of Graph Neural Networks: A Circuit Complexity Bound Perspective
Abstract:
Graph Neural Networks (GNNs) have become the standard approach for learning and reasoning over relational data, leveraging the message-passing mechanism that iteratively propagates node embeddings through graph structures. While GNNs have achieved significant empirical success, their theoretical limitations remain an active area of research. Existing studies primarily focus on characterizing GNN expressiveness through Weisfeiler-Lehman (WL) graph isomorphism tests. In this paper, we take a fundamentally different approach by exploring the computational limitations of GNNs through the lens of circuit complexity. Specifically, we analyze the circuit complexity of common GNN architectures and prove that under constraints of constant-depth layers, linear or sublinear embedding sizes, and polynomial precision, GNNs cannot solve key problems such as graph connectivity and graph isomorphism unless $\mathsf{TC}^0 = \mathsf{NC}^1$. These results reveal the intrinsic expressivity limitations of GNNs behind their empirical success and introduce a novel framework for analyzing GNN expressiveness that can be extended to a broader range of GNN models and graph decision problems.
Chinese: 本文通过电路复杂度分析图神经网络的计算局限性,证明在特定约束条件下除非复杂度类相等,否则GNN无法解决图连通性和同构等关键问题,从而揭示了其内在表达能力边界。
English: This paper analyzes the computational limitations of Graph Neural Networks (GNNs) through circuit complexity, demonstrating that under specific constraints GNNs cannot solve fundamental graph problems unless certain complexity classes are equal, thereby revealing intrinsic expressivity boundaries.

Authors:Yekun Ke, Xiaoyu Li, Yingyu Liang, Zhizhou Sha, Zhenmei Shi, Zhao Song
Title: On Computational Limits and Provably Efficient Criteria of Visual Autoregressive Models: A Fine-Grained Complexity Analysis
Abstract:
Recently, Visual Autoregressive ($\mathsf{VAR}$) Models introduced a groundbreaking advancement in the field of image generation, offering a scalable approach through a coarse-to-fine ``next-scale prediction'' paradigm. Suppose that $n$ represents the height and width of the last VQ code map generated by $\mathsf{VAR}$ models, the state-of-the-art algorithm in [Tian, Jiang, Yuan, Peng and Wang, NeurIPS 2024] takes $O(n^{4+o(1)})$ time, which is computationally inefficient. In this work, we analyze the computational limits and efficiency criteria of $\mathsf{VAR}$ Models through a fine-grained complexity lens. Our key contribution is identifying the conditions under which $\mathsf{VAR}$ computations can achieve sub-quadratic time complexity. We have proved that assuming the Strong Exponential Time Hypothesis ($\mathsf{SETH}$) from fine-grained complexity theory, a sub-quartic time algorithm for $\mathsf{VAR}$ models is impossible. To substantiate our theoretical findings, we present efficient constructions leveraging low-rank approximations that align with the derived criteria. This work initiates the study of the computational efficiency of the $\mathsf{VAR}$ model from a theoretical perspective. Our technique will shed light on advancing scalable and efficient image generation in $\mathsf{VAR}$ frameworks.
中文: 本研究基于强指数时间假说证明了视觉自回归模型无法实现次四次时间复杂度的算法,同时提出了利用低秩逼近的高效构造方法来提升计算效率。
English: This study establishes that sub-quartic time complexity for Visual Autoregressive (VAR) Models is impossible under the Strong Exponential Time Hypothesis, while proposing efficient low-rank approximations to enhance computational efficiency.

Authors:Yekun Ke, Xiaoyu Li, Yingyu Liang, Zhenmei Shi, Zhao Song
Title: Circuit Complexity Bounds for Visual Autoregressive Model
Abstract:
Understanding the expressive ability of a specific model is essential for grasping its capacity limitations. Recently, several studies have established circuit complexity bounds for Transformer architecture. Besides, the Visual AutoRegressive (VAR) model has risen to be a prominent method in the field of image generation, outperforming previous techniques, such as Diffusion Transformers, in generating high-quality images. We investigate the circuit complexity of the VAR model and establish a bound in this study. Our primary result demonstrates that the VAR model is equivalent to a simulation by a uniform $\mathsf{TC}^0$ threshold circuit with hidden dimension $d \leq O(n)$ and $\mathrm{poly}(n)$ precision. This is the first study to rigorously highlight the limitations in the expressive power of VAR models despite their impressive performance. We believe our findings will offer valuable insights into the inherent constraints of these models and guide the development of more efficient and expressive architectures in the future.
Chinese Summary: 本研究首次严谨证明,尽管视觉自回归模型在图像生成中表现卓越,但其电路复杂度受限于特定条件下的均匀$\mathsf{TC}^0$阈值电路,揭示了该模型表达能力的根本局限性。
English Summary: This study establishes that the Visual AutoRegressive (VAR) model, despite its superior image generation performance, has circuit complexity bounded by uniform $\mathsf{TC}^0$ with specific constraints, revealing its fundamental expressive limitations.

Authors:Lirong Wu, Haitao Lin, Yufei Huang, Zhangyang Gao, Cheng Tan, Yunfan Liu, Tailin Wu, Stan Z. Li
Title: Relation-Aware Equivariant Graph Networks for Epitope-Unknown Antibody Design and Specificity Optimization
Abstract:
Antibodies are Y-shaped proteins that protect the host by binding to specific antigens, and their binding is mainly determined by the Complementary Determining Regions (CDRs) in the antibody. Despite the great progress made in CDR design, existing computational methods still encounter several challenges: 1) poor capability of modeling complex CDRs with long sequences due to insufficient contextual information; 2) conditioned on pre-given antigenic epitopes and their static interaction with the target antibody; 3) neglect of specificity during antibody optimization leads to non-specific antibodies. In this paper, we take into account a variety of node features, edge features, and edge relations to include more contextual and geometric information. We propose a novel Relation-Aware Antibody Design (RAAD) framework, which dynamically models antigen-antibody interactions for co-designing the sequences and structures of antigen-specific CDRs. Furthermore, we propose a new evaluation metric to better measure antibody specificity and develop a contrasting specificity-enhancing constraint to optimize the specificity of antibodies. Extensive experiments have demonstrated the superior capability of RAAD in terms of antibody modeling, generation, and optimization across different CDR types, sequence lengths, pre-training strategies, and input contexts.
Chinese: 本文提出了一种关系感知抗体设计(RAAD)框架,通过动态建模抗原-抗体相互作用来协同设计抗原特异性CDR的序列和结构,结合上下文和几何特征及新的特异性评估指标,以解决现有计算方法在抗体优化中的局限性。
English: The paper introduces a Relation-Aware Antibody Design (RAAD) framework that dynamically models antigen-antibody interactions to co-design sequences and structures of antigen-specific CDRs, incorporating contextual and geometric features and a new specificity metric to overcome existing computational limitations in antibody optimization.

Authors:Kai Wang, Dongwen Tang, Wangbo Zhao, Konstantin Schürholt, Zhangyang Wang, Yang You
Title: Recurrent Diffusion for Large-Scale Parameter Generation
Abstract:
Parameter generation has long struggled to match the scale of today large vision and language models, curbing its broader utility. In this paper, we introduce Recurrent Diffusion for Large Scale Parameter Generation (RPG), a novel framework that generates full neural network parameters up to hundreds of millions on a single GPU. Our approach first partitions a networks parameters into non-overlapping tokens, each corresponding to a distinct portion of the model. A recurrent mechanism then learns the inter token relationships, producing prototypes which serve as conditions for a diffusion process that ultimately synthesizes the full parameters. Across a spectrum of architectures and tasks including ResNets, ConvNeXts and ViTs on ImageNet 1K and COCO, and even LoRA based LLMs RPG achieves performance on par with fully trained networks while avoiding excessive memory overhead. Notably, it generalizes beyond its training set to generate valid parameters for previously unseen tasks, highlighting its flexibility in dynamic and open ended scenarios. By overcoming the longstanding memory and scalability barriers, RPG serves as a critical advance in AI generating AI, potentially enabling efficient weight generation at scales previously deemed infeasible.
中文: RPG框架采用创新的循环扩散方法,可在单GPU上高效生成完整神经网络参数,不仅实现与训练网络相当的性能,还能泛化至未见任务,成功突破了长期存在的可扩展性瓶颈。
English: The RPG framework introduces a novel recurrent diffusion method that efficiently generates full neural network parameters on a single GPU, achieving performance comparable to trained networks while enabling generalization to unseen tasks and overcoming scalability barriers.

Authors:Yiming Cui, Jiajia Guo, Chao-Kai Wen, Shi Jin, En Tong
Title: Exploring the Potential of Large Language Models for Massive MIMO CSI Feedback
Abstract:
Large language models (LLMs) have achieved remarkable success across a wide range of tasks, particularly in natural language processing and computer vision. This success naturally raises an intriguing yet unexplored question: Can LLMs be harnessed to tackle channel state information (CSI) compression and feedback in massive multiple-input multiple-output (MIMO) systems? Efficient CSI feedback is a critical challenge in next-generation wireless communication. In this paper, we pioneer the use of LLMs for CSI compression, introducing a novel framework that leverages the powerful denoising capabilities of LLMs -- capable of error correction in language tasks -- to enhance CSI reconstruction performance. To effectively adapt LLMs to CSI data, we design customized pre-processing, embedding, and post-processing modules tailored to the unique characteristics of wireless signals. Extensive numerical results demonstrate the promising potential of LLMs in CSI feedback, opening up possibilities for this research direction.
中文: 本文开创性地将大语言模型应用于大规模MIMO系统中的信道状态信息压缩,通过定制化模块适配无线信号特性,利用模型的去噪能力显著提升CSI重构性能。
English: This paper pioneers the use of large language models (LLMs) for channel state information (CSI) compression in massive MIMO systems, introducing a novel framework that leverages LLMs' denoising capabilities to enhance CSI reconstruction through customized modules tailored to wireless signals.

Authors:Jiajia Guo, Yiming Cui, Chao-Kai Wen, Shi Jin
Title: Prompt-Enabled Large AI Models for CSI Feedback
Abstract:
Artificial intelligence (AI) has emerged as a promising tool for channel state information (CSI) feedback. While recent research primarily focuses on improving feedback accuracy on a specific dataset through novel architectures, the underlying mechanism of AI-based CSI feedback remains unclear. This study explores the mechanism through analyzing performance across diverse datasets, with findings suggesting that superior feedback performance stems from AI models' strong fitting capabilities and their ability to leverage environmental knowledge. Building on these findings, we propose a prompt enabled large AI model (LAM) for CSI feedback. The LAM employs powerful transformer blocks and is trained on extensive datasets from various scenarios. Meanwhile, the channel distribution (environmental knowledge) -- represented as the mean of channel magnitude in the angular-delay domain -- is incorporated as a prompt within the decoder to further enhance reconstruction quality. Simulation results confirm that the proposed prompt-enabled LAM significantly improves feedback accuracy and generalization performance while reducing data collection requirements in new scenarios.
中文: 本研究揭示人工智能CSI反馈性能源于模型强大的拟合能力和环境知识利用,据此提出提示型大型AI模型,在提升精度和泛化能力的同时减少新场景数据需求。
English: This study reveals that AI-based CSI feedback performance relies on models' strong fitting capabilities and environmental knowledge utilization, leading to the development of a prompt-enabled large AI model that enhances accuracy and generalization while reducing data needs.

Authors:Ziyang Ma, Zhuo Chen, Yuping Wang, Eng Siong Chng, Xie Chen
Title: Audio-CoT: Exploring Chain-of-Thought Reasoning in Large Audio Language Model
Abstract:
Large Audio-Language Models (LALMs) have demonstrated remarkable performance in tasks involving audio perception and understanding, such as speech recognition and audio captioning. However, their reasoning capabilities - critical for solving complex real-world problems - remain underexplored. In this work, we conduct the first exploration into integrating Chain-of-Thought (CoT) reasoning into LALMs to enhance their reasoning ability across auditory modalities. We evaluate representative CoT methods, analyzing their performance in both information extraction and reasoning tasks across sound, music, and speech domains. Our findings reveal that CoT methods significantly improve performance on easy and medium tasks but encounter challenges with hard tasks, where reasoning chains can confuse the model rather than improve accuracy. Additionally, we identify a positive correlation between reasoning path length and accuracy, demonstrating the potential of scaling inference for advanced instruction-following and reasoning. This study not only highlights the promise of CoT in enhancing LALM reasoning capabilities but also identifies key limitations and provides actionable directions for future research.
中文摘要:本研究首次将思维链推理融入大型音频语言模型,显著提升了简单和中等任务的性能,但在复杂任务中发现推理链可能降低准确性,同时揭示了推理路径长度与精度的正相关性。
English Summary: This study pioneers the integration of Chain-of-Thought reasoning into Large Audio-Language Models, demonstrating significant performance gains on easy-to-medium tasks while revealing challenges with complex problems where extended reasoning may reduce accuracy.

Authors:Guosheng Zhang, Keyao Wang, Haixiao Yue, Ajian Liu, Gang Zhang, Kun Yao, Errui Ding, Jingdong Wang
Title: Interpretable Face Anti-Spoofing: Enhancing Generalization with Multimodal Large Language Models
Abstract:
Face Anti-Spoofing (FAS) is essential for ensuring the security and reliability of facial recognition systems. Most existing FAS methods are formulated as binary classification tasks, providing confidence scores without interpretation. They exhibit limited generalization in out-of-domain scenarios, such as new environments or unseen spoofing types. In this work, we introduce a multimodal large language model (MLLM) framework for FAS, termed Interpretable Face Anti-Spoofing (I-FAS), which transforms the FAS task into an interpretable visual question answering (VQA) paradigm. Specifically, we propose a Spoof-aware Captioning and Filtering (SCF) strategy to generate high-quality captions for FAS images, enriching the model's supervision with natural language interpretations. To mitigate the impact of noisy captions during training, we develop a Lopsided Language Model (L-LM) loss function that separates loss calculations for judgment and interpretation, prioritizing the optimization of the former. Furthermore, to enhance the model's perception of global visual features, we design a Globally Aware Connector (GAC) to align multi-level visual representations with the language model. Extensive experiments on standard and newly devised One to Eleven cross-domain benchmarks, comprising 12 public datasets, demonstrate that our method significantly outperforms state-of-the-art methods.
中文: 本文提出了一种基于多模态大语言模型的可解释人脸防伪框架,通过将防伪任务转化为视觉问答模式并采用欺骗感知描述策略,在跨域基准测试中显著优于现有最优方法。
English: This paper introduces an interpretable face anti-spoofing framework using a multimodal large language model that transforms the task into visual question answering, achieving superior performance through spoof-aware captioning and global feature alignment across multiple datasets.

Authors:Maya Kruse, Shiyue Hu, Nicholas Derby, Yifu Wu, Samantha Stonbraker, Bingsheng Yao, Dakuo Wang, Elizabeth Goldberg, Yanjun Gao
Title: Large Language Models with Temporal Reasoning for Longitudinal Clinical Summarization and Prediction
Abstract:
Recent advances in large language models (LLMs) have shown potential in clinical text summarization, but their ability to handle long patient trajectories with multi-modal data spread across time remains underexplored. This study systematically evaluates several state-of-the-art open-source LLMs, their Retrieval Augmented Generation (RAG) variants and chain-of-thought (CoT) prompting on long-context clinical summarization and prediction. We examine their ability to synthesize structured and unstructured Electronic Health Records (EHR) data while reasoning over temporal coherence, by re-engineering existing tasks, including discharge summarization and diagnosis prediction from two publicly available EHR datasets. Our results indicate that long context windows improve input integration but do not consistently enhance clinical reasoning, and LLMs are still struggling with temporal progression and rare disease prediction. While RAG shows improvements in hallucination in some cases, it does not fully address these limitations. Our work fills the gap in long clinical text summarization, establishing a foundation for evaluating LLMs with multi-modal data and temporal reasoning.
This study evaluates advanced LLMs and their enhanced versions on summarizing long clinical records, finding they still struggle with temporal reasoning and rare diseases despite some improvements with extended context and RAG techniques.
English Summary:

Authors:Dayong Ye, Tianqing Zhu, Shang Wang, Bo Liu, Leo Yu Zhang, Wanlei Zhou, Yang Zhang
Title: Data-Free Model-Related Attacks: Unleashing the Potential of Generative AI
Abstract:
Generative AI technology has become increasingly integrated into our daily lives, offering powerful capabilities to enhance productivity. However, these same capabilities can be exploited by adversaries for malicious purposes. While existing research on adversarial applications of generative AI predominantly focuses on cyberattacks, less attention has been given to attacks targeting deep learning models. In this paper, we introduce the use of generative AI for facilitating model-related attacks, including model extraction, membership inference, and model inversion. Our study reveals that adversaries can launch a variety of model-related attacks against both image and text models in a data-free and black-box manner, achieving comparable performance to baseline methods that have access to the target models' training data and parameters in a white-box manner. This research serves as an important early warning to the community about the potential risks associated with generative AI-powered attacks on deep learning models.
中文: 本文揭示了生成式AI能在无需数据和黑盒条件下对深度学习模型实施多种攻击,其效果媲美白盒方法,为学界敲响了关于此类新型威胁的警钟。
English: This paper demonstrates how generative AI can enable data-free, black-box attacks on deep learning models—including model extraction, membership inference, and inversion—achieving results comparable to white-box methods, serving as a critical warning about these emerging threats.

Authors:Dayong Ye, Tianqing Zhu, Jiayang Li, Kun Gao, Bo Liu, Leo Yu Zhang, Wanlei Zhou, Yang Zhang
Title: Data Duplication: A Novel Multi-Purpose Attack Paradigm in Machine Unlearning
Abstract:
Duplication is a prevalent issue within datasets. Existing research has demonstrated that the presence of duplicated data in training datasets can significantly influence both model performance and data privacy. However, the impact of data duplication on the unlearning process remains largely unexplored. This paper addresses this gap by pioneering a comprehensive investigation into the role of data duplication, not only in standard machine unlearning but also in federated and reinforcement unlearning paradigms. Specifically, we propose an adversary who duplicates a subset of the target model's training set and incorporates it into the training set. After training, the adversary requests the model owner to unlearn this duplicated subset, and analyzes the impact on the unlearned model. For example, the adversary can challenge the model owner by revealing that, despite efforts to unlearn it, the influence of the duplicated subset remains in the model. Moreover, to circumvent detection by de-duplication techniques, we propose three novel near-duplication methods for the adversary, each tailored to a specific unlearning paradigm. We then examine their impacts on the unlearning process when de-duplication techniques are applied. Our findings reveal several crucial insights: 1) the gold standard unlearning method, retraining from scratch, fails to effectively conduct unlearning under certain conditions; 2) unlearning duplicated data can lead to significant model degradation in specific scenarios; and 3) meticulously crafted duplicates can evade detection by de-duplication methods.
Chinese Summary: 本研究开创性地探究了数据重复对多种机器学习遗忘范式的影响,发现重复数据会削弱遗忘效果、规避去重检测并导致模型性能下降。
English Summary: This study pioneers an investigation into how data duplication affects machine unlearning across various paradigms, revealing that duplicated data can undermine unlearning effectiveness, evade detection, and cause model degradation.

Authors:Yilun Zhao, Lujing Xie, Haowei Zhang, Guo Gan, Yitao Long, Zhiyuan Hu, Tongyan Hu, Weiyuan Chen, Chuhan Li, Junyang Song, Zhijian Xu, Chengye Wang, Weifeng Pan, Ziyao Shangguan, Xiangru Tang, Zhenwen Liang, Yixin Liu, Chen Zhao, Arman Cohan
Title: MMVU: Measuring Expert-Level Multi-Discipline Video Understanding
Abstract:
We introduce MMVU, a comprehensive expert-level, multi-discipline benchmark for evaluating foundation models in video understanding. MMVU includes 3,000 expert-annotated questions spanning 27 subjects across four core disciplines: Science, Healthcare, Humanities & Social Sciences, and Engineering. Compared to prior benchmarks, MMVU features three key advancements. First, it challenges models to apply domain-specific knowledge and perform expert-level reasoning to analyze specialized-domain videos, moving beyond the basic visual perception typically assessed in current video benchmarks. Second, each example is annotated by human experts from scratch. We implement strict data quality controls to ensure the high quality of the dataset. Finally, each example is enriched with expert-annotated reasoning rationals and relevant domain knowledge, facilitating in-depth analysis. We conduct an extensive evaluation of 32 frontier multimodal foundation models on MMVU. The latest System-2-capable models, o1 and Gemini 2.0 Flash Thinking, achieve the highest performance among the tested models. However, they still fall short of matching human expertise. Through in-depth error analyses and case studies, we offer actionable insights for future advancements in expert-level, knowledge-intensive video understanding for specialized domains.
中文: MMVU是一个全面的专家级视频理解基准,包含27个学科的3000个专家标注问题,评估显示即使如o1和Gemini 2.0等顶尖模型仍无法达到人类专家水平。
English: MMVU is a comprehensive expert-level benchmark for evaluating foundation models in video understanding, featuring 3,000 expert-annotated questions across 27 subjects and demonstrating that even top models like o1 and Gemini 2.0 still lag behind human expertise.

Authors:Jing Yao, Xiaoyuan Yi, Shitong Duan, Jindong Wang, Yuzhuo Bai, Muhua Huang, Peng Zhang, Tun Lu, Zhicheng Dou, Maosong Sun, Xing Xie
Title: Value Compass Benchmarks: A Platform for Fundamental and Validated Evaluation of LLMs Values
Abstract:
As Large Language Models (LLMs) achieve remarkable breakthroughs, aligning their values with humans has become imperative for their responsible development and customized applications. However, there still lack evaluations of LLMs values that fulfill three desirable goals. (1) Value Clarification: We expect to clarify the underlying values of LLMs precisely and comprehensively, while current evaluations focus narrowly on safety risks such as bias and toxicity. (2) Evaluation Validity: Existing static, open-source benchmarks are prone to data contamination and quickly become obsolete as LLMs evolve. Additionally, these discriminative evaluations uncover LLMs' knowledge about values, rather than valid assessments of LLMs' behavioral conformity to values. (3) Value Pluralism: The pluralistic nature of human values across individuals and cultures is largely ignored in measuring LLMs value alignment. To address these challenges, we presents the Value Compass Benchmarks, with three correspondingly designed modules. It (i) grounds the evaluation on motivationally distinct \textit{basic values to clarify LLMs' underlying values from a holistic view; (ii) applies a \textit{generative evolving evaluation framework with adaptive test items for evolving LLMs and direct value recognition from behaviors in realistic scenarios; (iii) propose a metric that quantifies LLMs alignment with a specific value as a weighted sum over multiple dimensions, with weights determined by pluralistic values.
中文: 该摘要指出当前大语言模型价值评估存在三大缺陷——范围局限、方法过时及忽视多元性,并提出"价值罗盘基准"通过整体价值澄清、动态评估框架和多维量化指标来解决这些问题。
English: The abstract identifies three key gaps in evaluating Large Language Models' value alignment—narrow scope, outdated methods, and lack of pluralism—and introduces the Value Compass Benchmarks to address these through holistic value clarification, adaptive evaluation, and multidimensional metrics.

Authors:Huiqiang Chen, Tianqing Zhu, Wanlei Zhou, Wei Zhao
Title: AFed: Algorithmic Fair Federated Learning
Abstract:
Federated Learning (FL) has gained significant attention as it facilitates collaborative machine learning among multiple clients without centralizing their data on a server. FL ensures the privacy of participating clients by locally storing their data, which creates new challenges in fairness. Traditional debiasing methods assume centralized access to sensitive information, rendering them impractical for the FL setting. Additionally, FL is more susceptible to fairness issues than centralized machine learning due to the diverse client data sources that may be associated with group information. Therefore, training a fair model in FL without access to client local data is important and challenging. This paper presents AFed, a straightforward yet effective framework for promoting group fairness in FL. The core idea is to circumvent restricted data access by learning the global data distribution. This paper proposes two approaches: AFed-G, which uses a conditional generator trained on the server side, and AFed-GAN, which improves upon AFed-G by training a conditional GAN on the client side. We augment the client data with the generated samples to help remove bias. Our theoretical analysis justifies the proposed methods, and empirical results on multiple real-world datasets demonstrate a substantial improvement in AFed over several baselines.
中文: AFed是一个通过生成模拟全局数据分布的合成数据来提升联邦学习中群体公平性的创新框架,无需访问客户端本地数据即可有效减少偏差。
English: AFed is a novel framework designed to enhance group fairness in Federated Learning by generating synthetic data that approximates the global data distribution, thereby mitigating bias without accessing local client data.

Authors:Shudong Liu, Yiqiao Jin, Cheng Li, Derek F. Wong, Qingsong Wen, Lichao Sun, Haipeng Chen, Xing Xie, Jindong Wang
Title: CultureVLM: Characterizing and Improving Cultural Understanding of Vision-Language Models for over 100 Countries
Abstract:
Vision-language models (VLMs) have advanced human-AI interaction but struggle with cultural understanding, often misinterpreting symbols, gestures, and artifacts due to biases in predominantly Western-centric training data. In this paper, we construct CultureVerse, a large-scale multimodal benchmark covering 19, 682 cultural concepts, 188 countries/regions, 15 cultural concepts, and 3 question types, with the aim of characterizing and improving VLMs' multicultural understanding capabilities. Then, we propose CultureVLM, a series of VLMs fine-tuned on our dataset to achieve significant performance improvement in cultural understanding. Our evaluation of 16 models reveals significant disparities, with a stronger performance in Western concepts and weaker results in African and Asian contexts. Fine-tuning on our CultureVerse enhances cultural perception, demonstrating cross-cultural, cross-continent, and cross-dataset generalization without sacrificing performance on models' general VLM benchmarks. We further present insights on cultural generalization and forgetting. We hope that this work could lay the foundation for more equitable and culturally aware multimodal AI systems.
中文摘要:视觉语言模型因西方中心训练数据常误解文化元素,而提出的CultureVerse基准和CultureVLM模型在保持通用性能的同时显著提升了跨文化理解能力。
English Summary: Vision-language models often misinterpret cultural elements due to Western-biased training data, but the proposed CultureVerse benchmark and CultureVLM models significantly enhance multicultural understanding while maintaining general performance.

Authors:Changchang Yin, Shihan Fu, Bingsheng Yao, Thai-Hoang Pham, Weidan Cao, Dakuo Wang, Jeffrey Caterino, Ping Zhang
Title: SepsisCalc: Integrating Clinical Calculators into Early Sepsis Prediction via Dynamic Temporal Graph Construction
Abstract:
Sepsis is an organ dysfunction caused by a deregulated immune response to an infection. Early sepsis prediction and identification allow for timely intervention, leading to improved clinical outcomes. Clinical calculators (e.g., the six-organ dysfunction assessment of SOFA) play a vital role in sepsis identification within clinicians' workflow, providing evidence-based risk assessments essential for sepsis diagnosis. However, artificial intelligence (AI) sepsis prediction models typically generate a single sepsis risk score without incorporating clinical calculators for assessing organ dysfunctions, making the models less convincing and transparent to clinicians. To bridge the gap, we propose to mimic clinicians' workflow with a novel framework SepsisCalc to integrate clinical calculators into the predictive model, yielding a clinically transparent and precise model for utilization in clinical settings. Practically, clinical calculators usually combine information from multiple component variables in Electronic Health Records (EHR), and might not be applicable when the variables are (partially) missing. We mitigate this issue by representing EHRs as temporal graphs and integrating a learning module to dynamically add the accurately estimated calculator to the graphs. Experimental results on real-world datasets show that the proposed model outperforms state-of-the-art methods on sepsis prediction tasks. Moreover, we developed a system to identify organ dysfunctions and potential sepsis risks, providing a human-AI interaction tool for deployment, which can help clinicians understand the prediction outputs and prepare timely interventions for the corresponding dysfunctions, paving the way for actionable clinical decision-making support for early intervention.
中文: 提出的SepsisCalc框架将临床计算器整合到AI模型中,通过动态估算缺失数据,在败血症预测中提高了透明度和准确性,并优于现有方法,同时提供人机交互工具以支持可操作的临床决策。
English: The proposed SepsisCalc framework integrates clinical calculators into AI models to enhance transparency and accuracy in sepsis prediction by dynamically estimating missing data and outperforming existing methods, while also providing a human-AI interaction tool for actionable clinical decision support.

Authors:Andrea Lacava, Leonardo Bonati, Niloofar Mohamadi, Rajeev Gangula, Florian Kaltenberger, Pedram Johari, Salvatore D'Oro, Francesca Cuomo, Michele Polese, Tommaso Melodia
Title: dApps: Enabling Real-Time AI-Based Open RAN Control
Abstract:
Open Radio Access Networks (RANs) leverage disaggregated and programmable RAN functions and open interfaces to enable closed-loop, data-driven radio resource management. This is performed through custom intelligent applications on the RAN Intelligent Controllers (RICs), optimizing RAN policy scheduling, network slicing, user session management, and medium access control, among others. In this context, we have proposed dApps as a key extension of the O-RAN architecture into the real-time and user-plane domains. Deployed directly on RAN nodes, dApps access data otherwise unavailable to RICs due to privacy or timing constraints, enabling the execution of control actions within shorter time intervals. In this paper, we propose for the first time a reference architecture for dApps, defining their life cycle from deployment by the Service Management and Orchestration (SMO) to real-time control loop interactions with the RAN nodes where they are hosted. We introduce a new dApp interface, E3, along with an Application Protocol (AP) that supports structured message exchanges and extensible communication for various service models. By bridging E3 with the existing O-RAN E2 interface, we enable dApps, xApps, and rApps to coexist and coordinate. These applications can then collaborate on complex use cases and employ hierarchical control to resolve shared resource conflicts. Finally, we present and open-source a dApp framework based on OpenAirInterface (OAI). We benchmark its performance in two real-time control use cases, i.e., spectrum sharing and positioning in a 5th generation (5G) Next Generation Node Base (gNB) scenario. Our experimental results show that standardized real-time control loops via dApps are feasible, achieving average control latency below 450 microseconds and allowing optimal use of shared spectral resources.
开放式无线接入网络通过解耦和可编程功能及智能控制器实现数据驱动的无线资源管理,其中dApps将这一能力扩展到实时领域,通过提出的包含E3接口的架构支持分层控制,实验证明其在频谱共享场景中可实现低于450微秒的控制延迟。
Open RANs utilize disaggregated, programmable functions and intelligent controllers to enable data-driven radio resource management, with dApps extending this capability into real-time domains through a proposed architecture that includes the E3 interface and supports hierarchical control, as demonstrated in experimental setups achieving sub-450 microsecond latency for efficient spectrum sharing.

Authors:Daoyuan Chen, Yilun Huang, Xuchen Pan, Nana Jiang, Haibin Wang, Yilei Zhang, Ce Ge, Yushuo Chen, Wenhao Zhang, Zhijian Ma, Jun Huang, Wei Lin, Yaliang Li, Bolin Ding, Jingren Zhou
Title: Data-Juicer 2.0: Cloud-Scale Adaptive Data Processing for and with Foundation Models
Abstract:
The burgeoning field of foundation models necessitates advanced data processing mechanisms capable of harnessing vast and valuable data with various types used by these models. Nevertheless, the current landscape presents unique challenges that traditional data processing frameworks struggle to handle effectively, particularly in handling the complexity of multimodal data. In response, we present Data-Juicer 2.0, a data processing system backed by 100+ data processing operators spanning text, image, video, and audio modalities, supporting more critical tasks including data analysis, synthesis, annotation, and foundation model post-training. With seamless compatibility and dedicated optimization for popular dataset hubs like Hugging Face and computing engines like Ray, it improves upon its predecessor in terms of usability, efficiency, and programmability. It features an easily accessible user interface layer that supports decoupled Python interactions, RESTful APIs, and conversational commands. It contains a new runtime layer optimized for adaptive execution and management across varying dataset scales, processing demands, and computational environments, while hiding unnecessary system details. Extensive empirical evaluations demonstrate Data-Juicer 2.0's remarkable performance and scalability, highlighting its capability to efficiently process TB-level data with 10k+ CPU cores. The system is publicly available and has been widely adopted in diverse research fields and real-world products such as Alibaba Cloud PAI. We actively maintain it and share insights from practical feedback, with the goal of facilitating research and application of next-generation foundation models.
中文: Data-Juicer 2.0 是一个先进的多模态数据处理系统,通过优化架构和丰富算子显著提升了基础模型的数据处理能力,支持在多样化计算环境中高效处理TB级数据。
English: Data-Juicer 2.0 is an advanced multimodal data processing system that enhances usability, efficiency, and scalability for foundation models, capable of handling TB-level data across diverse computational environments.

Authors:Lu Wang, Tianyuan Zhang, Yang Qu, Siyuan Liang, Yuwei Chen, Aishan Liu, Xianglong Liu, Dacheng Tao
Title: Black-Box Adversarial Attack on Vision Language Models for Autonomous Driving
Abstract:
Vision-language models (VLMs) have significantly advanced autonomous driving (AD) by enhancing reasoning capabilities; however, these models remain highly susceptible to adversarial attacks. While existing research has explored white-box attacks to some extent, the more practical and challenging black-box scenarios remain largely underexplored due to their inherent difficulty. In this paper, we take the first step toward designing black-box adversarial attacks specifically targeting VLMs in AD. We identify two key challenges for achieving effective black-box attacks in this context: the effectiveness across driving reasoning chains in AD systems and the dynamic nature of driving scenarios. To address this, we propose Cascading Adversarial Disruption (CAD). It first introduces Decision Chain Disruption, which targets low-level reasoning breakdown by generating and injecting deceptive semantics, ensuring the perturbations remain effective across the entire decision-making chain. Building on this, we present Risky Scene Induction, which addresses dynamic adaptation by leveraging a surrogate VLM to understand and construct high-level risky scenarios that are likely to result in critical errors in the current driving contexts. Extensive experiments conducted on multiple AD VLMs and benchmarks demonstrate that CAD achieves state-of-the-art attack effectiveness, significantly outperforming existing methods (+13.43% on average). Moreover, we validate its practical applicability through real-world attacks on AD vehicles powered by VLMs, where the route completion rate drops by 61.11% and the vehicle crashes directly into the obstacle vehicle with adversarial patches. Finally, we release CADA dataset, comprising 18,808 adversarial visual-question-answer pairs, to facilitate further evaluation and research in this critical domain. Our codes and dataset will be available after paper's acceptance.
中文摘要:本文提出级联对抗干扰(CAD)方法,针对自动驾驶中的视觉语言模型设计黑盒攻击,通过破坏决策链和诱导风险场景实现最优攻击效果,并经过真实场景验证。
English summary: This paper introduces Cascading Adversarial Disruption (CAD), a novel black-box attack method targeting vision-language models in autonomous driving that disrupts decision chains and induces risky scenarios, achieving state-of-the-art effectiveness with real-world validation.

Authors:Jiayi Zhang, Wenhui Yi, Bokai Xu, Zhe Wang, Huahua Xiao, Bo Ai
Title: ROMA: ROtary and Movable Antenna
Abstract:
The rotary and movable antenna (ROMA) architecture represents a next-generation multi-antenna technology that enables flexible adjustment of antenna position and array rotation angles of the transceiver. In this letter, we propose a ROMA-aided multi-user MIMO communication system to fully enhance the efficiency and reliability of system transmissions. By deploying ROMA panels at both the transmitter and receiver sides, and jointly optimizing the three-dimensional (3D) rotation angles of each ROMA panel and the relative positions of antenna elements based on the spatial distribution of users and channel state information (CSI), we can achieve the objective of maximizing the average spectral efficiency (SE). Subsequently, we conduct a detailed analysis of the average SE performance of the system under the consideration of maximum ratio (MR) precoding. Due to the non-convexity of the optimization problem in the ROMA multi-user MIMO system, we propose an efficient solution based on an alternating optimization (AO) algorithm. Finally, simulation results demonstrate that the AO-based ROMA architecture can significantly improve the average SE. Furthermore, the performance improvement becomes more pronounced as the size of the movable region and the transmission power increase.
中文摘要:ROMA架构通过优化天线位置和旋转角度,采用交替优化算法显著提升多用户MIMO系统的频谱效率,且性能随可移动区域和发射功率增大而增强。
English Summary: The ROMA architecture enhances multi-user MIMO systems by optimizing antenna positions and rotation angles, significantly improving spectral efficiency through an alternating optimization algorithm.

Authors:Zhenran Xu, Longyue Wang, Jifang Wang, Zhouyi Li, Senbao Shi, Xue Yang, Yiyu Wang, Baotian Hu, Jun Yu, Min Zhang
Title: FilmAgent: A Multi-Agent Framework for End-to-End Film Automation in Virtual 3D Spaces
Abstract:
Virtual film production requires intricate decision-making processes, including scriptwriting, virtual cinematography, and precise actor positioning and actions. Motivated by recent advances in automated decision-making with language agent-based societies, this paper introduces FilmAgent, a novel LLM-based multi-agent collaborative framework for end-to-end film automation in our constructed 3D virtual spaces. FilmAgent simulates various crew roles, including directors, screenwriters, actors, and cinematographers, and covers key stages of a film production workflow: (1) idea development transforms brainstormed ideas into structured story outlines; (2) scriptwriting elaborates on dialogue and character actions for each scene; (3) cinematography determines the camera setups for each shot. A team of agents collaborates through iterative feedback and revisions, thereby verifying intermediate scripts and reducing hallucinations. We evaluate the generated videos on 15 ideas and 4 key aspects. Human evaluation shows that FilmAgent outperforms all baselines across all aspects and scores 3.98 out of 5 on average, showing the feasibility of multi-agent collaboration in filmmaking. Further analysis reveals that FilmAgent, despite using the less advanced GPT-4o model, surpasses the single-agent o1, showing the advantage of a well-coordinated multi-agent system. Lastly, we discuss the complementary strengths and weaknesses of OpenAI's text-to-video model Sora and our FilmAgent in filmmaking.
中文: 本文提出FilmAgent,这是一个基于大语言模型的多智能体框架,通过模拟电影制作团队角色在3D虚拟空间中实现端到端自动化制片,并以3.98/5的人类评估分数超越所有基线方法,证明了多智能体协作在电影制作中的有效性。
English: The paper presents FilmAgent, an LLM-based multi-agent framework that automates end-to-end film production in 3D virtual spaces by simulating crew roles and outperforming baselines with a 3.98/5 human evaluation score, demonstrating the effectiveness of multi-agent collaboration.

Authors:Zonglei Jing, Zonghao Ying, Le Wang, Siyuan Liang, Aishan Liu, Xianglong Liu, Dacheng Tao
Title: CogMorph: Cognitive Morphing Attacks for Text-to-Image Models
Abstract:
The development of text-to-image (T2I) generative models, that enable the creation of high-quality synthetic images from textual prompts, has opened new frontiers in creative design and content generation. However, this paper reveals a significant and previously unrecognized ethical risk inherent in this technology and introduces a novel method, termed the Cognitive Morphing Attack (CogMorph), which manipulates T2I models to generate images that retain the original core subjects but embeds toxic or harmful contextual elements. This nuanced manipulation exploits the cognitive principle that human perception of concepts is shaped by the entire visual scene and its context, producing images that amplify emotional harm far beyond attacks that merely preserve the original semantics. To address this, we first construct an imagery toxicity taxonomy spanning 10 major and 48 sub-categories, aligned with human cognitive-perceptual dimensions, and further build a toxicity risk matrix resulting in 1,176 high-quality T2I toxic prompts. Based on this, our CogMorph first introduces Cognitive Toxicity Augmentation, which develops a cognitive toxicity knowledge base with rich external toxic representations for humans (e.g., fine-grained visual features) that can be utilized to further guide the optimization of adversarial prompts. In addition, we present Contextual Hierarchical Morphing, which hierarchically extracts critical parts of the original prompt (e.g., scenes, subjects, and body parts), and then iteratively retrieves and fuses toxic features to inject harmful contexts. Extensive experiments on multiple open-sourced T2I models and black-box commercial APIs (e.g., DALLE-3) demonstrate the efficacy of CogMorph which significantly outperforms other baselines by large margins (+20.62% on average).
中文: 本文提出CogMorph攻击方法,通过构建毒性知识库和分层特征融合技术,在保持图像核心主体的同时嵌入有害语境元素,利用认知原理增强情感伤害,实验证明其攻击效果显著优于现有方法。
English: This paper introduces CogMorph, a novel attack method that manipulates text-to-image models to embed harmful contextual elements while preserving core subjects, exploiting cognitive principles to amplify emotional harm, and demonstrates its superior effectiveness through extensive experiments.

Authors:Feijie Wu, Zitao Li, Fei Wei, Yaliang Li, Bolin Ding, Jing Gao
Title: Talk to Right Specialists: Routing and Planning in Multi-agent System for Question Answering
Abstract:
Leveraging large language models (LLMs), an agent can utilize retrieval-augmented generation (RAG) techniques to integrate external knowledge and increase the reliability of its responses. Current RAG-based agents integrate single, domain-specific knowledge sources, limiting their ability and leading to hallucinated or inaccurate responses when addressing cross-domain queries. Integrating multiple knowledge bases into a unified RAG-based agent raises significant challenges, including increased retrieval overhead and data sovereignty when sensitive data is involved. In this work, we propose RopMura, a novel multi-agent system that addresses these limitations by incorporating highly efficient routing and planning mechanisms. RopMura features two key components: a router that intelligently selects the most relevant agents based on knowledge boundaries and a planner that decomposes complex multi-hop queries into manageable steps, allowing for coordinating cross-domain responses. Experimental results demonstrate that RopMura effectively handles both single-hop and multi-hop queries, with the routing mechanism enabling precise answers for single-hop queries and the combined routing and planning mechanisms achieving accurate, multi-step resolutions for complex queries.
中文: RopMura提出了一种多智能体系统,通过智能路由和规划机制克服单领域RAG代理的局限,有效处理简单和复杂的跨领域查询。
English: RopMura introduces a multi-agent system with intelligent routing and planning mechanisms to overcome the limitations of single-domain RAG agents, enabling efficient handling of both simple and complex cross-domain queries.

Authors:Siyuan Huang, Liliang Chen, Pengfei Zhou, Shengcong Chen, Zhengkai Jiang, Yue Hu, Yue Liao, Peng Gao, Hongsheng Li, Maoqing Yao, Guanghui Ren
Title: EnerVerse: Envisioning Embodied Future Space for Robotics Manipulation
Abstract:
We introduce EnerVerse, a generative robotics foundation model that constructs and interprets embodied spaces. EnerVerse employs an autoregressive video diffusion framework to predict future embodied spaces from instructions, enhanced by a sparse context memory for long-term reasoning. To model the 3D robotics world, we propose Free Anchor Views (FAVs), a multi-view video representation offering flexible, task-adaptive perspectives to address challenges like motion ambiguity and environmental constraints. Additionally, we present EnerVerse-D, a data engine pipeline combining the generative model with 4D Gaussian Splatting, forming a self-reinforcing data loop to reduce the sim-to-real gap. Leveraging these innovations, EnerVerse translates 4D world representations into physical actions via a policy head (EnerVerse-A), enabling robots to execute task instructions. EnerVerse-A achieves state-of-the-art performance in both simulation and real-world settings.
中文: EnerVerse是一种生成式机器人基础模型,通过自回归视频扩散框架和自由锚点视图预测并建模具身空间,在仿真和真实世界任务中均实现了顶尖性能。
English: EnerVerse is a generative robotics foundation model that uses an autoregressive video diffusion framework and Free Anchor Views to predict and model embodied spaces, achieving state-of-the-art performance in both simulated and real-world tasks.

Authors:Xinshuo Hu, Zifei Shan, Xinping Zhao, Zetian Sun, Zhenyu Liu, Dongfang Li, Shaolin Ye, Xinyuan Wei, Qian Chen, Baotian Hu, Haofen Wang, Jun Yu, Min Zhang
Title: KaLM-Embedding: Superior Training Data Brings A Stronger Embedding Model
Abstract:
As retrieval-augmented generation prevails in large language models, embedding models are becoming increasingly crucial. Despite the growing number of general embedding models, prior work often overlooks the critical role of training data quality. In this work, we introduce KaLM-Embedding, a general multilingual embedding model that leverages a large quantity of cleaner, more diverse, and domain-specific training data. Our model has been trained with key techniques proven to enhance performance: (1) persona-based synthetic data to create diversified examples distilled from LLMs, (2) ranking consistency filtering to remove less informative samples, and (3) semi-homogeneous task batch sampling to improve training efficacy. Departing from traditional BERT-like architectures, we adopt Qwen2-0.5B as the pre-trained model, facilitating the adaptation of auto-regressive language models for general embedding tasks. Extensive evaluations of the MTEB benchmark across multiple languages show that our model outperforms others of comparable size, setting a new standard for multilingual embedding models with <1B parameters.
Chinese: KaLM-Embedding 是一种多语言嵌入模型,通过采用更清洁多样的数据和创新训练技术,在 MTEB 基准测试中超越了同类模型,为参数小于 10 亿的模型设立了新标杆。
English: KaLM-Embedding is a multilingual embedding model that uses cleaner, diverse data and innovative training techniques to outperform comparable models on the MTEB benchmark, setting a new standard for models under 1B parameters.

Authors:Yichi Zhang, Zhuo Chen, Lingbing Guo, Yajing Xu, Shaokai Chen, Mengshu Sun, Binbin Hu, Zhiqiang Zhang, Lei Liang, Wen Zhang, Huajun Chen
Title: Have We Designed Generalizable Structural Knowledge Promptings? Systematic Evaluation and Rethinking
Abstract:
Large language models (LLMs) have demonstrated exceptional performance in text generation within current NLP research. However, the lack of factual accuracy is still a dark cloud hanging over the LLM skyscraper. Structural knowledge prompting (SKP) is a prominent paradigm to integrate external knowledge into LLMs by incorporating structural representations, achieving state-of-the-art results in many knowledge-intensive tasks. However, existing methods often focus on specific problems, lacking a comprehensive exploration of the generalization and capability boundaries of SKP. This paper aims to evaluate and rethink the generalization capability of the SKP paradigm from four perspectives including Granularity, Transferability, Scalability, and Universality. To provide a thorough evaluation, we introduce a novel multi-granular, multi-level benchmark called SUBARU, consisting of 9 different tasks with varying levels of granularity and difficulty.
中文摘要:大型语言模型在文本生成方面表现出色,但事实准确性不足,本文通过构建多粒度基准,从四个维度评估结构知识提示范式的泛化能力。
English Summary: Large language models excel in text generation but struggle with factual accuracy, prompting the use of structural knowledge to enhance them; this paper evaluates the generalization of this approach through a new multi-level benchmark.

Authors:An Yang, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoyan Huang, Jiandong Jiang, Jianhong Tu, Jianwei Zhang, Jingren Zhou, Junyang Lin, Kai Dang, Kexin Yang, Le Yu, Mei Li, Minmin Sun, Qin Zhu, Rui Men, Tao He, Weijia Xu, Wenbiao Yin, Wenyuan Yu, Xiafei Qiu, Xingzhang Ren, Xinlong Yang, Yong Li, Zhiying Xu, Zipeng Zhang
Title: Qwen2.5-1M Technical Report
Abstract:
We introduce Qwen2.5-1M, a series of models that extend the context length to 1 million tokens. Compared to the previous 128K version, the Qwen2.5-1M series have significantly enhanced long-context capabilities through long-context pre-training and post-training. Key techniques such as long data synthesis, progressive pre-training, and multi-stage supervised fine-tuning are employed to effectively enhance long-context performance while reducing training costs. To promote the use of long-context models among a broader user base, we present and open-source our inference framework. This framework includes a length extrapolation method that can expand the model context lengths by at least four times, or even more, without additional training. To reduce inference costs, we implement a sparse attention method along with chunked prefill optimization for deployment scenarios and a sparsity refinement method to improve precision. Additionally, we detail our optimizations in the inference engine, including kernel optimization, pipeline parallelism, and scheduling optimization, which significantly enhance overall inference performance. By leveraging our inference framework, the Qwen2.5-1M models achieve a remarkable 3x to 7x prefill speedup in scenarios with 1 million tokens of context. This framework provides an efficient and powerful solution for developing applications that require long-context processing using open-source models. The Qwen2.5-1M series currently includes the open-source models Qwen2.5-7B-Instruct-1M and Qwen2.5-14B-Instruct-1M, as well as the API-accessed model Qwen2.5-Turbo. Evaluations show that Qwen2.5-1M models have been greatly improved in long-context tasks without compromising performance in short-context scenarios. Specifically, the Qwen2.5-14B-Instruct-1M model significantly outperforms GPT-4o-mini in long-context tasks and supports contexts eight times longer.
中文: Qwen2.5-1M系列通过先进的训练技术和开源推理框架将上下文长度扩展至100万标记,在长文本任务中实现显著速度提升并超越竞争对手。
English: The Qwen2.5-1M series extends context length to 1 million tokens through advanced training techniques and an open-source inference framework, achieving significant speed improvements and outperforming competitors in long-context tasks.

Authors:Guobin Shen, Jindong Li, Tenglong Li, Dongcheng Zhao, Yi Zeng
Title: $SpikePack$: Enhanced Information Flow in Spiking Neural Networks with High Hardware Compatibility
Abstract:
Spiking Neural Networks (SNNs) hold promise for energy-efficient, biologically inspired computing. We identify substantial informatio loss during spike transmission, linked to temporal dependencies in traditional Leaky Integrate-and-Fire (LIF) neuron-a key factor potentially limiting SNN performance. Existing SNN architectures also underutilize modern GPUs, constrained by single-bit spike storage and isolated weight-spike operations that restrict computational efficiency. We introduce ${SpikePack}$, a neuron model designed to reduce transmission loss while preserving essential features like membrane potential reset and leaky integration. ${SpikePack}$ achieves constant $\mathcal{O}(1)$ time and space complexity, enabling efficient parallel processing on GPUs and also supporting serial inference on existing SNN hardware accelerators. Compatible with standard Artificial Neural Network (ANN) architectures, ${SpikePack}$ facilitates near-lossless ANN-to-SNN conversion across various networks. Experimental results on tasks such as image classification, detection, and segmentation show ${SpikePack}$ achieves significant gains in accuracy and efficiency for both directly trained and converted SNNs over state-of-the-art models. Tests on FPGA-based platforms further confirm cross-platform flexibility, delivering high performance and enhanced sparsity. By enhancing information flow and rethinking SNN-ANN integration, ${SpikePack}$ advances efficient SNN deployment across diverse hardware platforms.
中文: SpikePack是一种新型神经元模型,可减少脉冲神经网络的信息传输损失并保持恒定计算复杂度,支持高效的GPU并行处理和近乎无损的人工神经网络转换,在多种任务和硬件平台上实现了精度与效率的显著提升。
English: SpikePack is a novel neuron model that reduces information loss in Spiking Neural Networks while maintaining constant computational complexity, enabling efficient GPU processing and near-lossless ANN-to-SNN conversion with demonstrated improvements in accuracy and efficiency across multiple tasks and hardware platforms.

Authors:Yuxuan Liang, Xu Li, Xiaolei Chen, Haotian Chen, Yi Zheng, Chenghang Lai, Bin Li, Xiangyang Xue
Title: Global Semantic-Guided Sub-image Feature Weight Allocation in High-Resolution Large Vision-Language Models
Abstract:
As the demand for high-resolution image processing in Large Vision-Language Models (LVLMs) grows, sub-image partitioning has become a popular approach for mitigating visual information loss associated with fixed-resolution processing. However, existing partitioning methods uniformly process sub-images, resulting in suboptimal image understanding. In this work, we reveal that the sub-images with higher semantic relevance to the entire image encapsulate richer visual information for preserving the model's visual understanding ability. Therefore, we propose the Global Semantic-guided Weight Allocator (GSWA) module, which dynamically allocates weights to sub-images based on their relative information density, emulating human visual attention mechanisms. This approach enables the model to focus on more informative regions, overcoming the limitations of uniform treatment. We integrate GSWA into the InternVL2-2B framework to create SleighVL, a lightweight yet high-performing model. Extensive experiments demonstrate that SleighVL outperforms models with comparable parameters and remains competitive with larger models. Our work provides a promising direction for more efficient and contextually aware high-resolution image processing in LVLMs, advancing multimodal system development.
中文: 本文提出的全局语义引导权重分配器(GSWA)模块通过语义关联性动态加权子图像,使SleighVL模型在高分辨率图像处理中优于同类模型并保持竞争力。
English: The Global Semantic-guided Weight Allocator (GSWA) module is introduced to dynamically weight sub-images by semantic relevance, enabling the SleighVL model to outperform comparable models in high-resolution image processing.

Authors:Xu Li, Yi Zheng, Haotian Chen, Xiaolei Chen, Yuxuan Liang, Chenghang Lai, Bin Li, Xiangyang Xue
Title: Instruction-Guided Fusion of Multi-Layer Visual Features in Large Vision-Language Models
Abstract:
Large Vision-Language Models (LVLMs) have achieved remarkable success in a wide range of multimodal tasks by integrating pre-trained vision encoders and large language models. However, current LVLMs primarily rely on visual features extracted from the final layers of the vision encoder, overlooking the complementary information available in shallower layers. While recent approaches have explored the use of multilayer visual features in LVLMs, they tend to be task-agnostic and fail to examine the dependencies of hierarchical visual features on specific tasks. To address these gaps, we systematically investigate the contributions of visual features from different encoder layers using 18 benchmarks spanning 6 task categories. Our findings reveal that multilayer features provide complementary strengths with varying task dependencies, and uniform fusion leads to suboptimal performance. Building on these insights, we propose the instruction-guided vision aggregator, a module that dynamically integrates multi-layer visual features based on textual instructions, without increasing the number of visual tokens. Extensive evaluations demonstrate the superior performance of our method. Additionally, an in-depth analysis of the aggregator's behavior highlights the dominance of mid-to-high-level features in semantic-rich tasks and the critical role of low-level features in fine-grained perception.
中文: 大型视觉语言模型目前对分层视觉特征利用不足,因此本研究提出指令引导的视觉聚合器,能根据文本指令动态整合多层视觉信息,在不增加视觉标记的前提下提升任务特定性能。
English: Large Vision-Language Models currently underutilize hierarchical visual features, so this study introduces an instruction-guided vision aggregator that dynamically integrates multi-layer visual information to enhance task-specific performance without increasing visual tokens.

Authors:Chengzu Li, Wenshan Wu, Huanyu Zhang, Yan Xia, Shaoguang Mao, Li Dong, Ivan Vulić, Furu Wei
Title: Imagine while Reasoning in Space: Multimodal Visualization-of-Thought
Abstract:
Chain-of-Thought (CoT) prompting has proven highly effective for enhancing complex reasoning in Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs). Yet, it struggles in complex spatial reasoning tasks. Nonetheless, human cognition extends beyond language alone, enabling the remarkable capability to think in both words and images. Inspired by this mechanism, we propose a new reasoning paradigm, Multimodal Visualization-of-Thought (MVoT). It enables visual thinking in MLLMs by generating image visualizations of their reasoning traces. To ensure high-quality visualization, we introduce token discrepancy loss into autoregressive MLLMs. This innovation significantly improves both visual coherence and fidelity. We validate this approach through several dynamic spatial reasoning tasks. Experimental results reveal that MVoT demonstrates competitive performance across tasks. Moreover, it exhibits robust and reliable improvements in the most challenging scenarios where CoT fails. Ultimately, MVoT establishes new possibilities for complex reasoning tasks where visual thinking can effectively complement verbal reasoning.
Chinese: 提出的多模态思维可视化(MVoT)方法通过生成视觉推理轨迹来增强多模态大语言模型的空间推理能力,在最具挑战性的场景中优于思维链提示,实现了视觉与语言推理的有效互补。
English: The proposed Multimodal Visualization-of-Thought (MVoT) paradigm enhances spatial reasoning in MLLMs by generating visual reasoning traces, outperforming Chain-of-Thought prompting in challenging scenarios through improved visual coherence and fidelity.

Authors:Xiang Xu, Lingdong Kong, Hui Shuai, Liang Pan, Ziwei Liu, Qingshan Liu
Title: LiMoE: Mixture of LiDAR Representation Learners from Automotive Scenes
Abstract:
LiDAR data pretraining offers a promising approach to leveraging large-scale, readily available datasets for enhanced data utilization. However, existing methods predominantly focus on sparse voxel representation, overlooking the complementary attributes provided by other LiDAR representations. In this work, we propose LiMoE, a framework that integrates the Mixture of Experts (MoE) paradigm into LiDAR data representation learning to synergistically combine multiple representations, such as range images, sparse voxels, and raw points. Our approach consists of three stages: i) Image-to-LiDAR Pretraining, which transfers prior knowledge from images to point clouds across different representations; ii) Contrastive Mixture Learning (CML), which uses MoE to adaptively activate relevant attributes from each representation and distills these mixed features into a unified 3D network; iii) Semantic Mixture Supervision (SMS), which combines semantic logits from multiple representations to boost downstream segmentation performance. Extensive experiments across eleven large-scale LiDAR datasets demonstrate our effectiveness and superiority. The code has been made publicly accessible.
中文摘要:LiMoE框架通过集成混合专家模式,结合图像到激光雷达的预训练、对比混合学习和语义混合监督,协同融合多种激光雷达表示方法,在十一个大规模数据集上实现了卓越性能。
English Summary: LiMoE integrates the Mixture of Experts paradigm to synergistically combine multiple LiDAR representations through image-to-LiDAR pretraining, contrastive mixture learning, and semantic mixture supervision, achieving superior performance across eleven datasets.

Authors:Shaoyuan Xie, Lingdong Kong, Yuhao Dong, Chonghao Sima, Wenwei Zhang, Qi Alfred Chen, Ziwei Liu, Liang Pan
Title: Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives
Abstract:
Recent advancements in Vision-Language Models (VLMs) have sparked interest in their use for autonomous driving, particularly in generating interpretable driving decisions through natural language. However, the assumption that VLMs inherently provide visually grounded, reliable, and interpretable explanations for driving remains largely unexamined. To address this gap, we introduce DriveBench, a benchmark dataset designed to evaluate VLM reliability across 17 settings (clean, corrupted, and text-only inputs), encompassing 19,200 frames, 20,498 question-answer pairs, three question types, four mainstream driving tasks, and a total of 12 popular VLMs. Our findings reveal that VLMs often generate plausible responses derived from general knowledge or textual cues rather than true visual grounding, especially under degraded or missing visual inputs. This behavior, concealed by dataset imbalances and insufficient evaluation metrics, poses significant risks in safety-critical scenarios like autonomous driving. We further observe that VLMs struggle with multi-modal reasoning and display heightened sensitivity to input corruptions, leading to inconsistencies in performance. To address these challenges, we propose refined evaluation metrics that prioritize robust visual grounding and multi-modal understanding. Additionally, we highlight the potential of leveraging VLMs' awareness of corruptions to enhance their reliability, offering a roadmap for developing more trustworthy and interpretable decision-making systems in real-world autonomous driving contexts. The benchmark toolkit is publicly accessible.
中文: 最新研究提出DriveBench基准,揭示视觉语言模型在自动驾驶决策中常因视觉基础不足和对输入干扰敏感而产生不可靠解释,为此提出了改进的评估指标以提升安全性。
English: Recent research introduces DriveBench, a benchmark revealing that Vision-Language Models often produce unreliable explanations for autonomous driving decisions due to insufficient visual grounding and sensitivity to input corruptions, prompting the proposal of improved evaluation metrics for enhanced safety.

Authors:Yansong Qu, Dian Chen, Xinyang Li, Xiaofan Li, Shengchuan Zhang, Liujuan Cao, Rongrong Ji
Title: Drag Your Gaussian: Effective Drag-Based Editing with Score Distillation for 3D Gaussian Splatting
Abstract:
Recent advancements in 3D scene editing have been propelled by the rapid development of generative models. Existing methods typically utilize generative models to perform text-guided editing on 3D representations, such as 3D Gaussian Splatting (3DGS). However, these methods are often limited to texture modifications and fail when addressing geometric changes, such as editing a character's head to turn around. Moreover, such methods lack accurate control over the spatial position of editing results, as language struggles to precisely describe the extent of edits. To overcome these limitations, we introduce DYG, an effective 3D drag-based editing method for 3D Gaussian Splatting. It enables users to conveniently specify the desired editing region and the desired dragging direction through the input of 3D masks and pairs of control points, thereby enabling precise control over the extent of editing. DYG integrates the strengths of the implicit triplane representation to establish the geometric scaffold of the editing results, effectively overcoming suboptimal editing outcomes caused by the sparsity of 3DGS in the desired editing regions. Additionally, we incorporate a drag-based Latent Diffusion Model into our method through the proposed Drag-SDS loss function, enabling flexible, multi-view consistent, and fine-grained editing. Extensive experiments demonstrate that DYG conducts effective drag-based editing guided by control point prompts, surpassing other baselines in terms of editing effect and quality, both qualitatively and quantitatively. Visit our project page at https://quyans.github.io/Drag-Your-Gaussian.
Chinese: 近期基于生成模型的3D场景编辑方法仅限于纹理修改且缺乏几何控制的精确性,为此我们提出了DYG——一种针对3D高斯泼溅的拖拽式编辑方法,通过3D遮罩和控制点对实现精准的空间编辑控制。
English: Recent advancements in 3D scene editing using generative models are limited to texture changes and lack precise geometric control, prompting the introduction of DYG, a drag-based method for 3D Gaussian Splatting that enables accurate spatial manipulation through 3D masks and control points.

Authors:Wei Zou, Shujian Huang, Jiajun Chen
Title: Extend Adversarial Policy Against Neural Machine Translation via Unknown Token
Abstract:
Generating adversarial examples contributes to mainstream neural machine translation~(NMT) robustness. However, popular adversarial policies are apt for fixed tokenization, hindering its efficacy for common character perturbations involving versatile tokenization. Based on existing adversarial generation via reinforcement learning~(RL), we propose the `DexChar policy' that introduces character perturbations for the existing mainstream adversarial policy based on token substitution. Furthermore, we improve the self-supervised matching that provides feedback in RL to cater to the semantic constraints required during training adversaries. Experiments show that our method is compatible with the scenario where baseline adversaries fail, and can generate high-efficiency adversarial examples for analysis and optimization of the system.
中文摘要:提出的DexChar策略通过引入字符级扰动和改进强化学习中的自监督匹配机制,有效增强了神经机器翻译的鲁棒性,能够在基线方法失效的场景下生成高效对抗样本。
English Summary: The proposed DexChar policy enhances neural machine translation robustness by introducing character-level perturbations and improved self-supervised matching in reinforcement learning, effectively generating adversarial examples where baseline methods fail.

Authors:Zhongxiang Sun, Qipeng Wang, Weijie Yu, Xiaoxue Zang, Kai Zheng, Jun Xu, Xiao Zhang, Song Yang, Han Li
Title: ReARTeR: Retrieval-Augmented Reasoning with Trustworthy Process Rewarding
Abstract:
Retrieval-Augmented Generation (RAG) systems for Large Language Models (LLMs) hold promise in knowledge-intensive tasks but face limitations in complex multi-step reasoning. While recent methods have integrated RAG with chain-of-thought reasoning or test-time search using Process Reward Models (PRMs), these approaches encounter challenges such as a lack of explanations, bias in PRM training data, early-step bias in PRM scores, and insufficient post-training optimization of reasoning potential. To address these issues, we propose Retrieval-Augmented Reasoning through Trustworthy Process Rewarding (ReARTeR), a framework that enhances RAG systems' reasoning capabilities through post-training and test-time scaling. At test time, ReARTeR introduces Trustworthy Process Rewarding via a Process Reward Model for accurate scalar scoring and a Process Explanation Model (PEM) for generating natural language explanations, enabling step refinement. During post-training, it utilizes Monte Carlo Tree Search guided by Trustworthy Process Rewarding to collect high-quality step-level preference data, optimized through Iterative Preference Optimization. ReARTeR addresses three core challenges: (1) misalignment between PRM and PEM, tackled through off-policy preference learning; (2) bias in PRM training data, mitigated by balanced annotation methods and stronger annotations for challenging examples; and (3) early-step bias in PRM, resolved through a temporal-difference-based look-ahead search strategy. Experimental results on multi-step reasoning benchmarks demonstrate significant improvements, underscoring ReARTeR's potential to advance the reasoning capabilities of RAG systems.
中文: ReARTeR框架通过可信过程奖励机制和后训练优化,有效解决了检索增强生成系统在复杂推理中的偏差和对齐问题,显著提升了多步推理能力。
English: ReARTeR enhances RAG systems' reasoning by integrating trustworthy process rewarding with post-training optimization and test-time scaling, addressing key limitations like bias and misalignment to achieve significant improvements in multi-step reasoning tasks.

Authors:Zheqi Lv, Wenkai Wang, Jiawei Wang, Shengyu Zhang, Fei Wu
Title: Cascaded Self-Evaluation Augmented Training for Lightweight Multimodal LLMs
Abstract:
Efficient Multimodal Large Language Models (EMLLMs) can improve performance through Chain-of-Thought (CoT) reasoning, but they have poor self-evaluation capabilities during the CoT reasoning process. This is due to their tendency to simplify the reasoning process and the degradation of self-evaluation ability during downstream task fine-tuning. To address this, we intuitively propose \textit{Self-Evaluation Augmented Training (SEAT)}, which uses more powerful EMLLMs to evaluate CoT reasoning data. The evaluation data is then used to train EMLLMs. However, due to the difficulties EMLLMs face with processing long token input-output sequences, and the degradation of self-evaluation ability as a basis for CoT reasoning, the SEAT method is not fully adapted. Therefore, we further propose \textit{Cascaded Self-Evaluation Augmented Training (Cas-SEAT)}, which converts long prompts into cascaded short prompts, each focusing on a specific task. Additionally, we mix CoT reasoning and self-evaluation data to preserve its CoT reasoning ability while enhancing the self-evaluation capability of EMLLMs. We also conduct \textit{Double-level Data Filtering (DDF)}, which includes source data filtering and labeled data filtering, using both manual selection and MLLMs for filtering. Cas-SEAT and DDF work together to improve the performance of EMLLMs. Experiments show that Cas-SEAT achieves an average improvement of 22.16% across multiple datasets, and DDF significantly reduces the resource consumption of training
Chinese: 针对高效多模态大语言模型在思维链推理中自我评估能力不足的问题,本研究提出级联自评估增强训练和双重数据过滤方法,不仅将模型性能平均提升22.16%,还显著降低了训练资源消耗。
English: To address the poor self-evaluation capabilities of Efficient Multimodal Large Language Models (EMLLMs) during Chain-of-Thought reasoning, this study introduces Cascaded Self-Evaluation Augmented Training (Cas-SEAT) and Double-level Data Filtering (DDF), which together enhance performance by 22.16% on average while reducing resource consumption.

Authors:Mingzi Wang, Yuan Meng, Chen Tang, Weixiang Zhang, Yijian Qin, Yang Yao, Yingxin Li, Tongtong Feng, Xin Wang, Xun Guan, Zhi Wang, Wenwu Zhu
Title: JAQ: Joint Efficient Architecture Design and Low-Bit Quantization with Hardware-Software Co-Exploration
Abstract:
The co-design of neural network architectures, quantization precisions, and hardware accelerators offers a promising approach to achieving an optimal balance between performance and efficiency, particularly for model deployment on resource-constrained edge devices. In this work, we propose the JAQ Framework, which jointly optimizes the three critical dimensions. However, effectively automating the design process across the vast search space of those three dimensions poses significant challenges, especially when pursuing extremely low-bit quantization. Specifical, the primary challenges include: (1) Memory overhead in software-side: Low-precision quantization-aware training can lead to significant memory usage due to storing large intermediate features and latent weights for back-propagation, potentially causing memory exhaustion. (2) Search time-consuming in hardware-side: The discrete nature of hardware parameters and the complex interplay between compiler optimizations and individual operators make the accelerator search time-consuming. To address these issues, JAQ mitigates the memory overhead through a channel-wise sparse quantization (CSQ) scheme, selectively applying quantization to the most sensitive components of the model during optimization. Additionally, JAQ designs BatchTile, which employs a hardware generation network to encode all possible tiling modes, thereby speeding up the search for the optimal compiler mapping strategy. Extensive experiments demonstrate the effectiveness of JAQ, achieving approximately 7% higher Top-1 accuracy on ImageNet compared to previous methods and reducing the hardware search time per iteration to 0.15 seconds.
中文:JAQ框架通过协同设计神经网络、量化精度和硬件加速器,有效解决了内存和搜索难题,为边缘设备实现了更高的准确性和更快的优化速度。
English: The JAQ Framework co-designs neural networks, quantization, and hardware to overcome memory and search challenges, achieving higher accuracy and faster optimization for edge devices.

Authors:Zhaoyi Yan, Yiming Zhang, Baoyi He, Yuhao Fu, Qi Zhou, Zhijie Sang, Chunlin Ji, Shengyu Zhang, Fei Wu, Hongxia Yang
Title: InfiFusion: A Unified Framework for Enhanced Cross-Model Reasoning via LLM Fusion
Abstract:
We introduce InfiFusion, an efficient training pipeline designed to integrate multiple domain-specialized Large Language Models (LLMs) into a single pivot model, effectively harnessing the strengths of each source model. Traditional fusion methods either merge model parameters directly or rely on knowledge distillation with rigid assumptions, limiting their flexibility and efficiency. InfiFusion overcomes these limitations by enhancing Universal Logit Distillation (ULD) with Top-K selection and Logits Standardization. We propose two fusion strategies: Pairwise Fusion (InfiFusion$_p$), where each source model knowledge is distilled individually into the pivot model followed by merging and Unified Fusion (InfiFusion$_u$), where knowledge from all source models is distilled simultaneously into the pivot model. InfiFusion outperforms the state-of-the-art models, such as Qwen-2.5-14B-Instruct and Phi-4, across 11 widely applied benchmarks covering reasoning, coding, mathematics, and instruction-following tasks. Notably, InfiFusion achieves this superior performance while significantly reduces computational costs, completing full training with only 160 H800 GPU hours compared to the millions typically required for traditional LLM training.
Chinese: InfiFusion是一种高效训练框架,通过增强的通用对数蒸馏技术将多个领域专用大语言模型融合为单一枢纽模型,在11项基准测试中超越先进模型,同时将计算成本大幅降至仅160 GPU小时。
English: InfiFusion is an efficient training pipeline that integrates multiple domain-specialized LLMs into a single pivot model using enhanced Universal Logit Distillation, outperforming state-of-the-art models across 11 benchmarks while drastically reducing computational costs to just 160 GPU hours.

Authors:Yixin Ji, Juntao Li, Yang Xiang, Hai Ye, Kaixin Wu, Kai Yao, Jia Xu, Linjian Mo, Min Zhang
Title: A Survey of Test-Time Compute: From Intuitive Inference to Deliberate Reasoning
Abstract:
The remarkable performance of the o1 model in complex reasoning demonstrates that test-time compute scaling can further unlock the model's potential, enabling powerful System-2 thinking. However, there is still a lack of comprehensive surveys for test-time compute scaling. We trace the concept of test-time compute back to System-1 models. In System-1 models, test-time compute addresses distribution shifts and improves robustness and generalization through parameter updating, input modification, representation editing, and output calibration. In System-2 models, it enhances the model's reasoning ability to solve complex problems through repeated sampling, self-correction, and tree search. We organize this survey according to the trend of System-1 to System-2 thinking, highlighting the key role of test-time compute in the transition from System-1 models to weak System-2 models, and then to strong System-2 models. We also point out advanced topics and future directions.
中文: o1模型在复杂推理中的卓越表现表明,测试时计算扩展能进一步释放模型潜力,实现强大的系统2思维,但目前缺乏相关全面综述,该概念从系统1模型发展到系统2模型,分别通过参数更新和重复采样等方式提升鲁棒性与推理能力。
English: The o1 model's advanced reasoning shows that test-time compute scaling enhances System-2 thinking, yet comprehensive surveys on this topic are lacking, with the concept traced from System-1 to System-2 models to improve robustness and complex problem-solving.

Authors:Xize Cheng, Dongjie Fu, Xiaoda Yang, Minghui Fang, Ruofan Hu, Jingyu Lu, Bai Jionghao, Zehan Wang, Shengpeng Ji, Rongjie Huang, Linjun Li, Yu Chen, Tao Jin, Zhou Zhao
Title: OmniChat: Enhancing Spoken Dialogue Systems with Scalable Synthetic Data for Diverse Scenarios
Abstract:
With the rapid development of large language models, researchers have created increasingly advanced spoken dialogue systems that can naturally converse with humans. However, these systems still struggle to handle the full complexity of real-world conversations, including audio events, musical contexts, and emotional expressions, mainly because current dialogue datasets are constrained in both scale and scenario diversity. In this paper, we propose leveraging synthetic data to enhance the dialogue models across diverse scenarios. We introduce ShareChatX, the first comprehensive, large-scale dataset for spoken dialogue that spans diverse scenarios. Based on this dataset, we introduce OmniChat, a multi-turn dialogue system with a heterogeneous feature fusion module, designed to optimize feature selection in different dialogue contexts. In addition, we explored critical aspects of training dialogue systems using synthetic data. Through comprehensive experimentation, we determined the ideal balance between synthetic and real data, achieving state-of-the-art results on the real-world dialogue dataset DailyTalk. We also highlight the crucial importance of synthetic data in tackling diverse, complex dialogue scenarios, especially those involving audio and music. For more details, please visit our demo page at \url{https://sharechatx.github.io/}.
中文: 本文提出了首个大规模合成口语对话数据集ShareChatX和多轮对话系统OmniChat,通过优化合成与真实数据的配比,在包含音频和音乐等复杂场景的对话任务中取得了最优性能。
English: This paper introduces ShareChatX, a large-scale synthetic spoken dialogue dataset, and OmniChat, a multi-turn dialogue system that achieves state-of-the-art performance by optimally integrating synthetic and real data to handle complex scenarios like audio events and emotional expressions.

Authors:Xuemiao Zhang, Liangyu Xu, Feiyu Duan, Yongwei Zhou, Sirui Wang, Rongxiang Weng, Jingang Wang, Xunliang Cai
Title: Preference Curriculum: LLMs Should Always Be Pretrained on Their Preferred Data
Abstract:
Large language models (LLMs) generally utilize a consistent data distribution throughout the pretraining process. However, as the model's capability improves, it is intuitive that its data preferences dynamically change, indicating the need for pretraining with different data at various training stages. To achieve it, we propose the Perplexity Difference (PD) based Preference Curriculum learning (PDPC) framework, which always perceives and uses the data preferred by LLMs to train and boost them. First, we introduce the PD metric to quantify the difference in how challenging a sample is for weak versus strong models. Samples with high PD are more challenging for weak models to learn and are more suitable to be arranged in the later stage of pretraining. Second, we propose the preference function to approximate and predict the data preference of the LLM at any training step, so as to complete the arrangement of the dataset offline and ensure continuous training without interruption. Experimental results on 1.3B and 3B models demonstrate that PDPC significantly surpasses baselines. Notably, the 3B model trained on 1T tokens achieves an increased average accuracy of over 8.1% across MMLU and CMMLU.
Chinese Summary: 基于困惑度差异的偏好课程学习框架(PDPC)通过动态选择大型语言模型在不同训练阶段偏好的数据,显著提升了模型性能,在多项基准测试中准确率提高超过8.1%。
English Summary: The Perplexity Difference-based Preference Curriculum (PDPC) framework dynamically selects training data preferred by large language models at different stages, significantly boosting performance as shown by over 8.1% accuracy gains on benchmarks.

Authors:Xiaoli Yan, Nathaniel Hudson, Hyun Park, Daniel Grzenda, J. Gregory Pauloski, Marcus Schwarting, Haochen Pan, Hassan Harb, Samuel Foreman, Chris Knight, Tom Gibbs, Kyle Chard, Santanu Chaudhuri, Emad Tajkhorshid, Ian Foster, Mohamad Moosavi, Logan Ward, E. A. Huerta
Title: MOFA: Discovering Materials for Carbon Capture with a GenAI- and Simulation-Based Workflow
Abstract:
We present MOFA, an open-source generative AI (GenAI) plus simulation workflow for high-throughput generation of metal-organic frameworks (MOFs) on large-scale high-performance computing (HPC) systems. MOFA addresses key challenges in integrating GPU-accelerated computing for GPU-intensive GenAI tasks, including distributed training and inference, alongside CPU- and GPU-optimized tasks for screening and filtering AI-generated MOFs using molecular dynamics, density functional theory, and Monte Carlo simulations. These heterogeneous tasks are unified within an online learning framework that optimizes the utilization of available CPU and GPU resources across HPC systems. Performance metrics from a 450-node (14,400 AMD Zen 3 CPUs + 1800 NVIDIA A100 GPUs) supercomputer run demonstrate that MOFA achieves high-throughput generation of novel MOF structures, with CO$_2$ adsorption capacities ranking among the top 10 in the hypothetical MOF (hMOF) dataset. Furthermore, the production of high-quality MOFs exhibits a linear relationship with the number of nodes utilized. The modular architecture of MOFA will facilitate its integration into other scientific applications that dynamically combine GenAI with large-scale simulations.
中文: MOFA是一个开源生成式AI与模拟工作流,能在大规模高性能计算系统上高效生成新型金属有机框架材料,其二氧化碳吸附性能位列前茅,且产出质量与计算节点数量呈线性增长关系。
English: MOFA is an open-source generative AI and simulation workflow that efficiently generates novel metal-organic frameworks on large-scale HPC systems, achieving top-tier CO₂ adsorption performance while demonstrating linear scalability with computing resources.

Authors:Alok Kamatar, Maxime Gonthier, Valerie Hayot-Sasson, Andre Bauer, Marcin Copik, Torsten Hoefler, Raul Castro Fernandez, Kyle Chard, Ian Foster
Title: Core Hours and Carbon Credits: Incentivizing Sustainability in HPC
Abstract:
Realizing a shared responsibility between providers and consumers is critical to manage the sustainability of HPC. However, while cost may motivate efficiency improvements by infrastructure operators, broader progress is impeded by a lack of user incentives. We conduct a survey of HPC users that reveals fewer than 30 percent are aware of their energy consumption, and that energy efficiency is among users' lowest priority concerns. One explanation is that existing pricing models may encourage users to prioritize performance over energy efficiency. We propose two transparent multi-resource pricing schemes, Energy- and Carbon-Based Accounting, that seek to change this paradigm by incentivizing more efficient user behavior. These two schemes charge for computations based on their energy consumption or carbon footprint, respectively, rewarding users who leverage efficient hardware and software. We evaluate these two pricing schemes via simulation, in a prototype, and a user study.
中文摘要:实现高性能计算的可持续性需供应商与用户共担责任,但用户激励不足阻碍了进展,因此提出基于能耗和碳足迹的透明定价方案,通过奖励高效软硬件使用来引导用户行为。
English Summary: Achieving sustainability in high-performance computing requires shared provider-user responsibility, which is hindered by the lack of user incentives, leading to proposed energy- and carbon-based pricing schemes to encourage efficient behavior through transparent accounting.

Authors:Lala Shakti Swarup Ray, Bo Zhou, Sungho Suh, Paul Lukowicz
Title: Initial Findings on Sensor based Open Vocabulary Activity Recognition via Text Embedding Inversion
Abstract:
Conventional human activity recognition (HAR) relies on classifiers trained to predict discrete activity classes, inherently limiting recognition to activities explicitly present in the training set. Such classifiers would invariably fail, putting zero likelihood, when encountering unseen activities. We propose Open Vocabulary HAR (OV-HAR), a framework that overcomes this limitation by first converting each activity into natural language and breaking it into a sequence of elementary motions. This descriptive text is then encoded into a fixed-size embedding. The model is trained to regress this embedding, which is subsequently decoded back into natural language using a pre-trained embedding inversion model. Unlike other works that rely on auto-regressive large language models (LLMs) at their core, OV-HAR achieves open vocabulary recognition without the computational overhead of such models. The generated text can be transformed into a single activity class using LLM prompt engineering. We have evaluated our approach on different modalities, including vision (pose), IMU, and pressure sensors, demonstrating robust generalization across unseen activities and modalities, offering a fundamentally different paradigm from contemporary classifiers.
中文: OV-HAR通过将活动转化为自然语言嵌入并解码,实现了开放词汇的人类活动识别,无需依赖计算密集型大语言模型,并在未见过的活动和传感器模态上展现出强大的泛化能力。
English: OV-HAR enables open vocabulary human activity recognition by converting activities into natural language embeddings and decoding them without relying on computationally intensive large language models, demonstrating strong generalization across unseen activities and sensor modalities.

Authors:Qian Chen, Yafeng Chen, Yanni Chen, Mengzhe Chen, Yingda Chen, Chong Deng, Zhihao Du, Ruize Gao, Changfeng Gao, Zhifu Gao, Yabin Li, Xiang Lv, Jiaqing Liu, Haoneng Luo, Bin Ma, Chongjia Ni, Xian Shi, Jialong Tang, Hui Wang, Hao Wang, Wen Wang, Yuxuan Wang, Yunlan Xu, Fan Yu, Zhijie Yan, Yexin Yang, Baosong Yang, Xian Yang, Guanrou Yang, Tianyu Zhao, Qinglin Zhang, Shiliang Zhang, Nan Zhao, Pei Zhang, Chong Zhang, Jinren Zhou
Title: MinMo: A Multimodal Large Language Model for Seamless Voice Interaction
Abstract:
Recent advancements in large language models (LLMs) and multimodal speech-text models have laid the groundwork for seamless voice interactions, enabling real-time, natural, and human-like conversations. Previous models for voice interactions are categorized as native and aligned. Native models integrate speech and text processing in one framework but struggle with issues like differing sequence lengths and insufficient pre-training. Aligned models maintain text LLM capabilities but are often limited by small datasets and a narrow focus on speech tasks. In this work, we introduce MinMo, a Multimodal Large Language Model with approximately 8B parameters for seamless voice interaction. We address the main limitations of prior aligned multimodal models. We train MinMo through multiple stages of speech-to-text alignment, text-to-speech alignment, speech-to-speech alignment, and duplex interaction alignment, on 1.4 million hours of diverse speech data and a broad range of speech tasks. After the multi-stage training, MinMo achieves state-of-the-art performance across various benchmarks for voice comprehension and generation while maintaining the capabilities of text LLMs, and also facilitates full-duplex conversation, that is, simultaneous two-way communication between the user and the system. Moreover, we propose a novel and simple voice decoder that outperforms prior models in voice generation. The enhanced instruction-following capabilities of MinMo supports controlling speech generation based on user instructions, with various nuances including emotions, dialects, and speaking rates, and mimicking specific voices. For MinMo, the speech-to-text latency is approximately 100ms, full-duplex latency is approximately 600ms in theory and 800ms in practice. The MinMo project web page is https://funaudiollm.github.io/minmo, and the code and models will be released soon.
中文: 本研究提出的MinMo是一个80亿参数的多模态大语言模型,通过多阶段训练在大量语音数据上实现了语音理解与生成的最优性能,同时支持全双工对话并具备增强的语音控制功能。
English: This work introduces MinMo, an 8-billion-parameter multimodal large language model that achieves state-of-the-art performance in voice comprehension and generation through multi-stage training on extensive speech data, while enabling full-duplex conversations and enhanced voice control capabilities.

Authors:Steven Au, Cameron J. Dimacali, Ojasmitha Pedirappagari, Namyong Park, Franck Dernoncourt, Yu Wang, Nikos Kanakaris, Hanieh Deilamsalehy, Ryan A. Rossi, Nesreen K. Ahmed
Title: Personalized Graph-Based Retrieval for Large Language Models
Abstract:
As large language models (LLMs) evolve, their ability to deliver personalized and context-aware responses offers transformative potential for improving user experiences. Existing personalization approaches, however, often rely solely on user history to augment the prompt, limiting their effectiveness in generating tailored outputs, especially in cold-start scenarios with sparse data. To address these limitations, we propose Personalized Graph-based Retrieval-Augmented Generation (PGraphRAG), a framework that leverages user-centric knowledge graphs to enrich personalization. By directly integrating structured user knowledge into the retrieval process and augmenting prompts with user-relevant context, PGraphRAG enhances contextual understanding and output quality. We also introduce the Personalized Graph-based Benchmark for Text Generation, designed to evaluate personalized text generation tasks in real-world settings where user history is sparse or unavailable. Experimental results show that PGraphRAG significantly outperforms state-of-the-art personalization methods across diverse tasks, demonstrating the unique advantages of graph-based retrieval for personalization.
中文: 提出的PGraphRAG框架通过将用户知识图谱融入检索过程,显著提升了文本生成的个性化效果,即使在数据稀疏的冷启动场景下也优于现有方法。
English: The proposed PGraphRAG framework enhances personalized text generation by integrating user-centric knowledge graphs into retrieval processes, significantly outperforming existing methods even in cold-start scenarios with sparse data.

Authors:Dayuan Fu, Keqing He, Yejie Wang, Wentao Hong, Zhuoma Gongque, Weihao Zeng, Wei Wang, Jingang Wang, Xunliang Cai, Weiran Xu
Title: AgentRefine: Enhancing Agent Generalization through Refinement Tuning
Abstract:
Large Language Model (LLM) based agents have proved their ability to perform complex tasks like humans. However, there is still a large gap between open-sourced LLMs and commercial models like the GPT series. In this paper, we focus on improving the agent generalization capabilities of LLMs via instruction tuning. We first observe that the existing agent training corpus exhibits satisfactory results on held-in evaluation sets but fails to generalize to held-out sets. These agent-tuning works face severe formatting errors and are frequently stuck in the same mistake for a long while. We analyze that the poor generalization ability comes from overfitting to several manual agent environments and a lack of adaptation to new situations. They struggle with the wrong action steps and can not learn from the experience but just memorize existing observation-action relations. Inspired by the insight, we propose a novel AgentRefine framework for agent-tuning. The core idea is to enable the model to learn to correct its mistakes via observation in the trajectory. Specifically, we propose an agent synthesis framework to encompass a diverse array of environments and tasks and prompt a strong LLM to refine its error action according to the environment feedback. AgentRefine significantly outperforms state-of-the-art agent-tuning work in terms of generalization ability on diverse agent tasks. It also has better robustness facing perturbation and can generate diversified thought in inference. Our findings establish the correlation between agent generalization and self-refinement and provide a new paradigm for future research.
中文:AgentRefine框架通过让模型从环境反馈中学习纠正错误,显著提升了LLM智能体的泛化能力,在多样任务中优于现有方法且具备更强的鲁棒性。
English: The AgentRefine framework enhances LLM agents' generalization by enabling them to learn from mistakes through environmental feedback, outperforming existing methods in adaptability and robustness.

Authors:Yanwen Huang, Yong Zhang, Ning Cheng, Zhitao Li, Shaojun Wang, Jing Xiao
Title: Dynamic Attention-Guided Context Decoding for Mitigating Context Faithfulness Hallucinations in Large Language Models
Abstract:
Large language models (LLMs) often exhibit Context Faithfulness Hallucinations, where outputs deviate from retrieved information due to incomplete context integration. Our analysis reveals a strong correlation between token-level uncertainty and hallucinations. We hypothesize that attention mechanisms inherently encode context utilization signals, supported by probing analysis. Based on these insights, we propose Dynamic Attention-Guided Context Decoding (DAGCD), a lightweight framework that leverages attention distributions and uncertainty signals in a single-pass decoding. Experiments on open-book QA datasets demonstrate DAGCD's effectiveness, yielding significant improvements in faithfulness and robustness while preserving computational efficiency.
Chinese Summary: 本研究提出动态注意力引导上下文解码(DAGCD)框架,通过利用注意力分布和不确定性信号来减少大语言模型中的上下文忠实性幻觉,在开放问答数据集上验证了其有效性。
English Summary: The study introduces Dynamic Attention-Guided Context Decoding (DAGCD), a lightweight framework that uses attention distributions and uncertainty signals to reduce context faithfulness hallucinations in large language models, showing improved performance on QA datasets.

Authors:Hieu Man, Nghia Trung Ngo, Viet Dac Lai, Ryan A. Rossi, Franck Dernoncourt, Thien Huu Nguyen
Title: LUSIFER: Language Universal Space Integration for Enhanced Multilingual Embeddings with Large Language Models
Abstract:
Recent advancements in large language models (LLMs) based embedding models have established new state-of-the-art benchmarks for text embedding tasks, particularly in dense vector-based retrieval. However, these models predominantly focus on English, leaving multilingual embedding capabilities largely unexplored. To address this limitation, we present LUSIFER, a novel zero-shot approach that adapts LLM-based embedding models for multilingual tasks without requiring multilingual supervision. LUSIFER's architecture combines a multilingual encoder, serving as a language-universal learner, with an LLM-based embedding model optimized for embedding-specific tasks. These components are seamlessly integrated through a minimal set of trainable parameters that act as a connector, effectively transferring the multilingual encoder's language understanding capabilities to the specialized embedding model. Additionally, to comprehensively evaluate multilingual embedding performance, we introduce a new benchmark encompassing 5 primary embedding tasks, 123 diverse datasets, and coverage across 14 languages. Extensive experimental results demonstrate that LUSIFER significantly enhances the multilingual performance across various embedding tasks, particularly for medium and low-resource languages, without requiring explicit multilingual training data.
中文: LUSIFER提出了一种零样本方法,通过最小可训练连接器将多语言编码器与专用嵌入模型结合,使基于大语言模型的嵌入模型适应多语言任务,无需多语言训练数据即可显著提升14种语言的性能。
English: LUSIFER introduces a zero-shot method to adapt LLM-based embedding models for multilingual tasks by integrating a multilingual encoder with specialized embedding models through minimal trainable connectors, significantly improving performance across 14 languages without multilingual training data.

Authors:Lala Shakti Swarup Ray, Bo Zhou, Sungho Suh, Paul Lukowicz
Title: OV-HHIR: Open Vocabulary Human Interaction Recognition Using Cross-modal Integration of Large Language Models
Abstract:
Understanding human-to-human interactions, especially in contexts like public security surveillance, is critical for monitoring and maintaining safety. Traditional activity recognition systems are limited by fixed vocabularies, predefined labels, and rigid interaction categories that often rely on choreographed videos and overlook concurrent interactive groups. These limitations make such systems less adaptable to real-world scenarios, where interactions are diverse and unpredictable. In this paper, we propose an open vocabulary human-to-human interaction recognition (OV-HHIR) framework that leverages large language models to generate open-ended textual descriptions of both seen and unseen human interactions in open-world settings without being confined to a fixed vocabulary. Additionally, we create a comprehensive, large-scale human-to-human interaction dataset by standardizing and combining existing public human interaction datasets into a unified benchmark. Extensive experiments demonstrate that our method outperforms traditional fixed-vocabulary classification systems and existing cross-modal language models for video understanding, setting the stage for more intelligent and adaptable visual understanding systems in surveillance and beyond.
中文: 本文提出了一种开放词汇的人与人交互识别框架,利用大语言模型突破传统系统固定分类的限制,通过统一基准数据集验证,在性能上超越了现有方法。
English: This paper introduces an open vocabulary human-to-human interaction recognition framework that overcomes the limitations of traditional systems by leveraging large language models to describe diverse interactions without fixed categories, validated through a unified benchmark dataset and outperforming existing methods.

Authors:Kainan Liu, Yong Zhang, Ning Cheng, Zhitao Li, Shaojun Wang, Jing Xiao
Title: GRASP: Replace Redundant Layers with Adaptive Singular Parameters for Efficient Model Compression
Abstract:
Recent studies have demonstrated that many layers are functionally redundant in large language models (LLMs), enabling model compression by removing these layers to reduce inference cost. While such approaches can improve efficiency, indiscriminate layer pruning often results in significant performance degradation. In this paper, we propose GRASP (Gradient-based Retention of Adaptive Singular Parameters), a novel compression framework that mitigates this issue by preserving sensitivity-aware singular values. Unlike direct layer pruning, GRASP leverages gradient-based attribution on a small calibration dataset to adaptively identify and retain critical singular components. By replacing redundant layers with only a minimal set of parameters, GRASP achieves efficient compression while maintaining strong performance with minimal overhead. Experiments across multiple LLMs show that GRASP consistently outperforms existing compression methods, achieving 90% of the original model's performance under a 20% compression ratio.
中文: GRASP是一种基于梯度的新型压缩框架,通过保留关键奇异参数,在最小开销下高效缩减大语言模型规模的同时保持优异性能。
English: GRASP is a novel gradient-based compression framework that preserves critical singular parameters to efficiently reduce LLM size while maintaining strong performance with minimal overhead.

Authors:Ling Fu, Zhebin Kuang, Jiajun Song, Mingxin Huang, Biao Yang, Yuzhe Li, Linghao Zhu, Qidi Luo, Xinyu Wang, Hao Lu, Zhang Li, Guozhi Tang, Bin Shan, Chunhui Lin, Qi Liu, Binghong Wu, Hao Feng, Hao Liu, Can Huang, Jingqun Tang, Wei Chen, Lianwen Jin, Yuliang Liu, Xiang Bai
Title: OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning
Abstract:
Scoring the Optical Character Recognition (OCR) capabilities of Large Multimodal Models (LMMs) has witnessed growing interest. Existing benchmarks have highlighted the impressive performance of LMMs in text recognition; however, their abilities in certain challenging tasks, such as text localization, handwritten content extraction, and logical reasoning, remain underexplored. To bridge this gap, we introduce OCRBench v2, a large-scale bilingual text-centric benchmark with currently the most comprehensive set of tasks (4x more tasks than the previous multi-scene benchmark OCRBench), the widest coverage of scenarios (31 diverse scenarios), and thorough evaluation metrics, with 10,000 human-verified question-answering pairs and a high proportion of difficult samples. Moreover, we construct a private test set with 1,500 manually annotated images. The consistent evaluation trends observed across both public and private test sets validate the OCRBench v2's reliability. After carefully benchmarking state-of-the-art LMMs, we find that most LMMs score below 50 (100 in total) and suffer from five-type limitations, including less frequently encountered text recognition, fine-grained perception, layout perception, complex element parsing, and logical reasoning. The project website is at: https://99franklin.github.io/ocrbench_v2/
中文: OCRBench v2 被提出作为一个全面的双语基准,用于评估大型多模态模型的OCR能力,结果显示大多数模型因在文本定位和逻辑推理等挑战性任务上的局限而得分低于50%。
English: OCRBench v2 is introduced as a comprehensive bilingual benchmark to evaluate Large Multimodal Models' OCR capabilities, revealing that most models score below 50% due to limitations in challenging tasks like text localization and logical reasoning.

Authors:Lo Pang-Yun Ting, Zhen Tan, Hong-Pei Chen, Cheng-Te Li, Po-Lin Chen, Kun-Ta Chuang, Huan Liu
Title: CAND: Cross-Domain Ambiguity Inference for Early Detecting Nuanced Illness Deterioration
Abstract:
Early detection of patient deterioration is essential for timely treatment, with vital signs like heart rates being key health indicators. Existing methods tend to solely analyze vital sign waveforms, ignoring transition relationships of waveforms within each vital sign and the correlation strengths among various vital signs. Such studies often overlook nuanced illness deterioration, which is the early sign of worsening health but is difficult to detect. In this paper, we introduce CAND, a novel method that organizes the transition relationships and the correlations within and among vital signs as domain-specific and cross-domain knowledge. CAND jointly models these knowledge in a unified representation space, considerably enhancing the early detection of nuanced illness deterioration. In addition, CAND integrates a Bayesian inference method that utilizes augmented knowledge from domain-specific and cross-domain knowledge to address the ambiguities in correlation strengths. With this architecture, the correlation strengths can be effectively inferred to guide joint modeling and enhance representations of vital signs. This allows a more holistic and accurate interpretation of patient health. Our experiments on a real-world ICU dataset demonstrate that CAND significantly outperforms existing methods in both effectiveness and earliness in detecting nuanced illness deterioration. Moreover, we conduct a case study for the interpretable detection process to showcase the practicality of CAND.
中文:CAND方法通过将生命体征内的转换关系和体征间的相关性强度统一建模,显著提升了细微病情恶化的早期检测能力,在效果和及时性上均优于现有方法。
English: The CAND method enhances early detection of nuanced patient deterioration by modeling transition relationships within vital signs and correlation strengths among them in a unified space, outperforming existing approaches in effectiveness and timeliness.

Authors:Xiaoyang Wang, Hongming Zhang, Tao Ge, Wenhao Yu, Dian Yu, Dong Yu
Title: OpenCharacter: Training Customizable Role-Playing LLMs with Large-Scale Synthetic Personas
Abstract:
Customizable role-playing in large language models (LLMs), also known as character generalization, is gaining increasing attention for its versatility and cost-efficiency in developing and deploying role-playing dialogue agents. This study explores a large-scale data synthesis approach to equip LLMs with character generalization capabilities. We begin by synthesizing large-scale character profiles using personas from Persona Hub and then explore two strategies: response rewriting and response generation, to create character-aligned instructional responses. To validate the effectiveness of our synthetic instruction tuning data for character generalization, we perform supervised fine-tuning (SFT) using the LLaMA-3 8B model. Our best-performing model strengthens the original LLaMA-3 8B Instruct model and achieves performance comparable to GPT-4o models on role-playing dialogue. We release our synthetic characters and instruction-tuning dialogues to support public research.
中文: 本研究利用Persona Hub角色库开发可扩展的数据合成方法,通过监督微调增强大语言模型的角色泛化能力,在角色扮演对话中达到GPT-4o级别性能,并公开了合成数据以支持研究。
English: This study develops a scalable data synthesis method using personas from Persona Hub to enhance large language models' character generalization through supervised fine-tuning, achieving GPT-4o-level performance in role-playing dialogues with publicly released synthetic data.

Authors:Bo Yang, Jiaxian Guo, Yusuke Iwasawa, Yutaka Matsuo
Title: Large Language Models as Theory of Mind Aware Generative Agents with Counterfactual Reflection
Abstract:
Recent studies have increasingly demonstrated that large language models (LLMs) possess significant theory of mind (ToM) capabilities, showing the potential for simulating the tracking of mental states in generative agents. In this study, we propose a novel paradigm called ToM-agent, designed to empower LLMs-based generative agents to simulate ToM in open-domain conversational interactions. ToM-agent disentangles the confidence from mental states, facilitating the emulation of an agent's perception of its counterpart's mental states, such as beliefs, desires, and intentions (BDIs). Using past conversation history and verbal reflections, ToM-Agent can dynamically adjust counterparts' inferred BDIs, along with related confidence levels. We further put forth a counterfactual intervention method that reflects on the gap between the predicted responses of counterparts and their real utterances, thereby enhancing the efficiency of reflection. Leveraging empathetic and persuasion dialogue datasets, we assess the advantages of implementing the ToM-agent with downstream tasks, as well as its performance in both the first-order and the \textit{second-order} ToM. Our findings indicate that the ToM-agent can grasp the underlying reasons for their counterpart's behaviors beyond mere semantic-emotional supporting or decision-making based on common sense, providing new insights for studying large-scale LLMs-based simulation of human social behaviors.
中文摘要:最新研究提出ToM-agent新范式,通过动态追踪对话中的心理状态与置信度来增强大语言模型的心理理论能力,实验表明该方法能够超越传统语义或常识推理,更深入理解人类行为动因。
English Summary: Recent research introduces ToM-agent, a novel paradigm that enhances large language models' theory of mind capabilities by dynamically tracking mental states and confidence levels in conversations, showing improved understanding of human behavior beyond conventional semantic or commonsense reasoning.

Authors:Yantuan Xian, Pu Li, Hao Peng, Zhengtao Yu, Yan Xiang, Philip S. Yu
Title: Community Detection in Large-Scale Complex Networks via Structural Entropy Game
Abstract:
Community detection is a critical task in graph theory, social network analysis, and bioinformatics, where communities are defined as clusters of densely interconnected nodes. However, detecting communities in large-scale networks with millions of nodes and billions of edges remains challenging due to the inefficiency and unreliability of existing methods. Moreover, many current approaches are limited to specific graph types, such as unweighted or undirected graphs, reducing their broader applicability. To address these issues, we propose a novel heuristic community detection algorithm, termed CoDeSEG, which identifies communities by minimizing the two-dimensional (2D) structural entropy of the network within a potential game framework. In the game, nodes decide to stay in current community or move to another based on a strategy that maximizes the 2D structural entropy utility function. Additionally, we introduce a structural entropy-based node overlapping heuristic for detecting overlapping communities, with a near-linear time complexity.Experimental results on real-world networks demonstrate that CoDeSEG is the fastest method available and achieves state-of-the-art performance in overlapping normalized mutual information (ONMI) and F1 score.
中文: 提出的CoDeSEG算法通过潜在博弈框架最小化二维结构熵来有效检测大规模网络中的社区,在接近线性的时间复杂度下实现了最优性能。
English: The proposed CoDeSEG algorithm efficiently detects communities in large-scale networks by minimizing 2D structural entropy within a game-theoretic framework, achieving state-of-the-art performance with near-linear time complexity.

Authors:Kangjie Zheng, Junwei Yang, Siyue Liang, Bin Feng, Zequn Liu, Wei Ju, Zhiping Xiao, Ming Zhang
Title: ExLM: Rethinking the Impact of [MASK] Tokens in Masked Language Models
Abstract:
Masked Language Models (MLMs) have achieved remarkable success in many self-supervised representation learning tasks. MLMs are trained by randomly masking portions of the input sequences with [MASK] tokens and learning to reconstruct the original content based on the remaining context. This paper explores the impact of [MASK] tokens on MLMs. Analytical studies show that masking tokens can introduce the corrupted semantics problem, wherein the corrupted context may convey multiple, ambiguous meanings. This problem is also a key factor affecting the performance of MLMs on downstream tasks. Based on these findings, we propose a novel enhanced-context MLM, ExLM. Our approach expands [MASK] tokens in the input context and models the dependencies between these expanded states. This enhancement increases context capacity and enables the model to capture richer semantic information, effectively mitigating the corrupted semantics problem during pre-training. Experimental results demonstrate that ExLM achieves significant performance improvements in both text modeling and SMILES modeling tasks. Further analysis confirms that ExLM enriches semantic representations through context enhancement, and effectively reduces the semantic multimodality commonly observed in MLMs.
中文: 本文提出ExLM增强上下文掩码语言模型,通过扩展[MASK]标记并建模扩展状态间的依赖关系,有效缓解语义损坏问题,在文本和SMILES建模任务中取得显著性能提升。
English: This paper introduces ExLM, an enhanced-context masked language model that expands [MASK] tokens to model dependencies between expanded states, effectively mitigating the corrupted semantics problem and achieving significant performance improvements in text and SMILES modeling tasks.

Authors:Haolin Jin, Huaming Chen, Qinghua Lu, Liming Zhu
Title: Towards Advancing Code Generation with Large Language Models: A Research Roadmap
Abstract:
Recently, we have witnessed the rapid development of large language models, which have demonstrated excellent capabilities in the downstream task of code generation. However, despite their potential, LLM-based code generation still faces numerous technical and evaluation challenges, particularly when embedded in real-world development. In this paper, we present our vision for current research directions, and provide an in-depth analysis of existing studies on this task. We propose a six-layer vision framework that categorizes code generation process into distinct phases, namely Input Phase, Orchestration Phase, Development Phase, and Validation Phase. Additionally, we outline our vision workflow, which reflects on the currently prevalent frameworks. We systematically analyse the challenges faced by large language models, including those LLM-based agent frameworks, in code generation tasks. With these, we offer various perspectives and actionable recommendations in this area. Our aim is to provide guidelines for improving the reliability, robustness and usability of LLM-based code generation systems. Ultimately, this work seeks to address persistent challenges and to provide practical suggestions for a more pragmatic LLM-based solution for future code generation endeavors.
中文: 本文针对基于大语言模型的代码生成所面临的技术与评估挑战,提出了一个分层框架和深入分析,旨在通过具体建议提升其在实际开发中的可靠性和实用性。
English: This paper presents a framework and analysis addressing the technical and evaluation challenges of LLM-based code generation, offering recommendations to enhance its reliability and practicality in real-world development.

Authors:Xingyi He, Hao Yu, Sida Peng, Dongli Tan, Zehong Shen, Hujun Bao, Xiaowei Zhou
Title: MatchAnything: Universal Cross-Modality Image Matching with Large-Scale Pre-Training
Abstract:
Image matching, which aims to identify corresponding pixel locations between images, is crucial in a wide range of scientific disciplines, aiding in image registration, fusion, and analysis. In recent years, deep learning-based image matching algorithms have dramatically outperformed humans in rapidly and accurately finding large amounts of correspondences. However, when dealing with images captured under different imaging modalities that result in significant appearance changes, the performance of these algorithms often deteriorates due to the scarcity of annotated cross-modal training data. This limitation hinders applications in various fields that rely on multiple image modalities to obtain complementary information. To address this challenge, we propose a large-scale pre-training framework that utilizes synthetic cross-modal training signals, incorporating diverse data from various sources, to train models to recognize and match fundamental structures across images. This capability is transferable to real-world, unseen cross-modality image matching tasks. Our key finding is that the matching model trained with our framework achieves remarkable generalizability across more than eight unseen cross-modality registration tasks using the same network weight, substantially outperforming existing methods, whether designed for generalization or tailored for specific tasks. This advancement significantly enhances the applicability of image matching technologies across various scientific disciplines and paves the way for new applications in multi-modality human and artificial intelligence analysis and beyond.
中文: 基于深度学习的图像匹配在速度和精度上表现优异,但面对跨模态图像时因训练数据不足而受限;我们提出的预训练框架利用合成数据解决了这一问题,显著提升了模型在多种任务中的泛化能力。
English: Deep learning-based image matching excels in speed and accuracy but struggles with cross-modal images due to limited training data, a challenge addressed by our pre-training framework using synthetic data to boost generalizability across diverse tasks.

Authors:Gouki Minegishi, Hiroki Furuta, Yusuke Iwasawa, Yutaka Matsuo
Title: Rethinking Evaluation of Sparse Autoencoders through the Representation of Polysemous Words
Abstract:
Sparse autoencoders (SAEs) have gained a lot of attention as a promising tool to improve the interpretability of large language models (LLMs) by mapping the complex superposition of polysemantic neurons into monosemantic features and composing a sparse dictionary of words. However, traditional performance metrics like Mean Squared Error and L0 sparsity ignore the evaluation of the semantic representational power of SAEs -- whether they can acquire interpretable monosemantic features while preserving the semantic relationship of words. For instance, it is not obvious whether a learned sparse feature could distinguish different meanings in one word. In this paper, we propose a suite of evaluations for SAEs to analyze the quality of monosemantic features by focusing on polysemous words. Our findings reveal that SAEs developed to improve the MSE-L0 Pareto frontier may confuse interpretability, which does not necessarily enhance the extraction of monosemantic features. The analysis of SAEs with polysemous words can also figure out the internal mechanism of LLMs; deeper layers and the Attention module contribute to distinguishing polysemy in a word. Our semantics focused evaluation offers new insights into the polysemy and the existing SAE objective and contributes to the development of more practical SAEs.
中文摘要:本文提出了一种针对稀疏自编码器(SAE)的新评估方法,专注于分析多义词的单义特征提取能力,发现传统指标优化未必提升可解释性,并揭示了深层网络和注意力机制在区分多义词义中的作用。
English Summary: This paper introduces a new evaluation framework for sparse autoencoders (SAEs) that assesses their ability to extract monosemantic features from polysemous words, revealing that optimizing traditional metrics may not improve interpretability and highlighting the role of deeper layers and attention in resolving polysemy.

Authors:Mingyang Song, Zhaochen Su, Xiaoye Qu, Jiawei Zhou, Yu Cheng
Title: PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models
Abstract:
Process-level Reward Models (PRMs) are crucial for complex reasoning and decision-making tasks, where each intermediate step plays an important role in the reasoning process. Since language models are prone to various types of errors during the reasoning process, PRMs are required to possess nuanced capabilities for detecting various implicit error types in real-world scenarios. However, current benchmarks primarily focus on step correctness, failing to evaluate PRMs' performance systematically. To address this gap, we introduce PRMBench, a process-level benchmark specifically designed to assess the fine-grained error detection capabilities of PRMs. PRMBench comprises 6,216 carefully designed problems and 83,456 step-level labels, evaluating models across multiple dimensions, including simplicity, soundness, and sensitivity. In our experiments on 15 models, spanning both open-source PRMs and closed-source large language models prompted as critic models, we uncover significant weaknesses in current PRMs. These findings underscore the challenges inherent in process-level evaluation and highlight key directions for future research. We hope PRMBench can be a robust bench for advancing research on PRM evaluation and development.
Chinese: PRMBench是一个专门设计的过程级基准测试,用于系统评估过程级奖励模型在检测推理步骤中细微错误的能力,揭示了现有模型的显著不足并指明了未来研究方向。
English: PRMBench is a new benchmark designed to systematically evaluate Process-level Reward Models (PRMs) by testing their ability to detect fine-grained errors in reasoning steps, revealing significant weaknesses in current models and highlighting future research directions.

Authors:Jiawei Yang, Jiahui Huang, Yuxiao Chen, Yan Wang, Boyi Li, Yurong You, Apoorva Sharma, Maximilian Igl, Peter Karkus, Danfei Xu, Boris Ivanovic, Yue Wang, Marco Pavone
Title: STORM: Spatio-Temporal Reconstruction Model for Large-Scale Outdoor Scenes
Abstract:
We present STORM, a spatio-temporal reconstruction model designed for reconstructing dynamic outdoor scenes from sparse observations. Existing dynamic reconstruction methods often rely on per-scene optimization, dense observations across space and time, and strong motion supervision, resulting in lengthy optimization times, limited generalization to novel views or scenes, and degenerated quality caused by noisy pseudo-labels for dynamics. To address these challenges, STORM leverages a data-driven Transformer architecture that directly infers dynamic 3D scene representations--parameterized by 3D Gaussians and their velocities--in a single forward pass. Our key design is to aggregate 3D Gaussians from all frames using self-supervised scene flows, transforming them to the target timestep to enable complete (i.e., "amodal") reconstructions from arbitrary viewpoints at any moment in time. As an emergent property, STORM automatically captures dynamic instances and generates high-quality masks using only reconstruction losses. Extensive experiments on public datasets show that STORM achieves precise dynamic scene reconstruction, surpassing state-of-the-art per-scene optimization methods (+4.3 to 6.6 PSNR) and existing feed-forward approaches (+2.1 to 4.7 PSNR) in dynamic regions. STORM reconstructs large-scale outdoor scenes in 200ms, supports real-time rendering, and outperforms competitors in scene flow estimation, improving 3D EPE by 0.422m and Acc5 by 28.02%. Beyond reconstruction, we showcase four additional applications of our model, illustrating the potential of self-supervised learning for broader dynamic scene understanding.
中文: STORM是一种创新的时空重建模型,通过Transformer架构从稀疏观测中单次前向推断动态户外场景,在重建质量和效率上均优于现有方法。
English: STORM is a novel spatio-temporal reconstruction model that uses a Transformer architecture to reconstruct dynamic outdoor scenes from sparse observations in a single forward pass, achieving superior performance over existing methods in both reconstruction quality and efficiency.

Authors:Jiageng Mao, Boyi Li, Boris Ivanovic, Yuxiao Chen, Yan Wang, Yurong You, Chaowei Xiao, Danfei Xu, Marco Pavone, Yue Wang
Title: DreamDrive: Generative 4D Scene Modeling from Street View Images
Abstract:
Synthesizing photo-realistic visual observations from an ego vehicle's driving trajectory is a critical step towards scalable training of self-driving models. Reconstruction-based methods create 3D scenes from driving logs and synthesize geometry-consistent driving videos through neural rendering, but their dependence on costly object annotations limits their ability to generalize to in-the-wild driving scenarios. On the other hand, generative models can synthesize action-conditioned driving videos in a more generalizable way but often struggle with maintaining 3D visual consistency. In this paper, we present DreamDrive, a 4D spatial-temporal scene generation approach that combines the merits of generation and reconstruction, to synthesize generalizable 4D driving scenes and dynamic driving videos with 3D consistency. Specifically, we leverage the generative power of video diffusion models to synthesize a sequence of visual references and further elevate them to 4D with a novel hybrid Gaussian representation. Given a driving trajectory, we then render 3D-consistent driving videos via Gaussian splatting. The use of generative priors allows our method to produce high-quality 4D scenes from in-the-wild driving data, while neural rendering ensures 3D-consistent video generation from the 4D scenes. Extensive experiments on nuScenes and street view images demonstrate that DreamDrive can generate controllable and generalizable 4D driving scenes, synthesize novel views of driving videos with high fidelity and 3D consistency, decompose static and dynamic elements in a self-supervised manner, and enhance perception and planning tasks for autonomous driving.
中文摘要:DreamDrive是一种创新的4D场景生成方法,通过融合生成模型与神经渲染技术,能够基于车辆轨迹合成具有三维一致性的逼真驾驶视频,并通过高质量场景生成提升自动驾驶任务的性能。
English Summary: DreamDrive is a novel 4D scene generation method that combines generative models with neural rendering to synthesize realistic driving videos with 3D consistency from vehicle trajectories, enhancing autonomous driving tasks through high-fidelity scene generation.

Authors:Haruki Sakajo, Yusuke Sakai, Hidetaka Kamigaito, Taro Watanabe
Title: Tonguescape: Exploring Language Models Understanding of Vowel Articulation
Abstract:
Vowels are primarily characterized by tongue position. Humans have discovered these features of vowel articulation through their own experience and explicit objective observation such as using MRI. With this knowledge and our experience, we can explain and understand the relationship between tongue positions and vowels, and this knowledge is helpful for language learners to learn pronunciation. Since language models (LMs) are trained on a large amount of data that includes linguistic and medical fields, our preliminary studies indicate that an LM is able to explain the pronunciation mechanisms of vowels. However, it is unclear whether multi-modal LMs, such as vision LMs, align textual information with visual information. One question arises: do LMs associate real tongue positions with vowel articulation? In this study, we created video and image datasets from the existing real-time MRI dataset and investigated whether LMs can understand vowel articulation based on tongue positions using vision-based information. Our findings suggest that LMs exhibit potential for understanding vowels and tongue positions when reference examples are provided while they have difficulties without them. Our code for dataset building is available on GitHub.
中文摘要:研究表明,语言模型在提供视觉参考示例时能通过舌位理解元音发音机制,但在无参考时存在困难,该结论基于实时核磁共振数据集构建的实验得出。
English Summary: Language models demonstrate potential in understanding vowel articulation through tongue positions when provided with visual reference examples, though they struggle without such guidance, as shown by a study using real-time MRI datasets.

Authors:Geraldo F. Oliveira, Mayank Kabra, Yuxin Guo, Kangqi Chen, A. Giray Yağlıkçı, Melina Soysal, Mohammad Sadrosadati, Joaquin Olivares Bueno, Saugata Ghose, Juan Gómez-Luna, Onur Mutlu
Title: Proteus: Enabling High-Performance Processing-Using-DRAM with Dynamic Bit-Precision, Adaptive Data Representation, and Flexible Arithmetic
Abstract:
Processing-using-DRAM (PUD) is a paradigm where the analog operational properties of DRAM are used to perform bulk logic operations. While PUD promises high throughput at low energy and area cost, we uncover three limitations of existing PUD approaches that lead to significant inefficiencies: (i) static data representation, i.e., two's complement with fixed bit-precision, leading to unnecessary computation over useless (i.e., inconsequential) data; (ii) support for only throughput-oriented execution, where the high latency of individual PUD operations can only be hidden in the presence of bulk data-level parallelism; and (iii) high latency for high-precision (e.g., 32-bit) operations. To address these issues, we propose Proteus, the first hardware framework that addresses the high execution latency of bulk bitwise PUD operations by implementing a data-aware runtime engine for PUD. Proteus reduces the latency of PUD operations in three different ways: (i) Proteus dynamically reduces the bit-precision (and thus the latency and energy consumption) of PUD operations by exploiting narrow values (i.e., values with many leading zeros or ones); (ii) Proteus concurrently executes independent in-DRAM primitives belonging to a single PUD operation across multiple DRAM arrays; (iii) Proteus chooses and uses the most appropriate data representation and arithmetic algorithm implementation for a given PUD instruction transparently to the programmer.
Chinese: Proteus是一种创新的硬件框架,通过动态优化位精度、在多个DRAM阵列中并行执行操作,并透明选择高效数据表示,有效降低了DRAM内存计算(PUD)操作的高延迟问题。
English: Proteus is a novel hardware framework that mitigates the high latency of processing-using-DRAM (PUD) operations by dynamically optimizing bit-precision, enabling concurrent execution across DRAM arrays, and selecting efficient data representations transparently.

Authors:Kalanit Suzan Segal, Hadar Cochavi Gorelik, Oleg Brodt, Yuval Elbahar, Yuval Elovici, Asaf Shabtai
Title: UEFI Memory Forensics: A Framework for UEFI Threat Analysis
Abstract:
Modern computing systems rely on the Unified Extensible Firmware Interface (UEFI), which has replaced the traditional BIOS as the firmware standard for the modern boot process. Despite the advancements, UEFI is increasingly targeted by threat actors seeking to exploit its execution environment and take advantage of its persistence mechanisms. While some security-related analysis of UEFI components has been performed--primarily via debugging and runtime behavior testing--to the best of our knowledge, no prior study has specifically addressed capturing and analyzing volatile UEFI runtime memory to detect malicious exploitation during the pre-OS phase. This gap in UEFI forensic tools limits the ability to conduct in-depth security analyses in pre-OS environments. Such a gap is especially surprising, given that memory forensics is widely regarded as foundational to modern incident response, reflected by the popularity of above-OS memory analysis frameworks, such as Rekall, Volatility, and MemProcFS. To address the lack of below-OS memory forensics, we introduce a framework for UEFI memory forensics. The proposed framework consists of two primary components: UefiMemDump, a memory acquisition tool, and UEFIDumpAnalysis, an extendable collection of analysis modules capable of detecting malicious activities such as function pointer hooking, inline hooking, and malicious image loading. Our proof-of-concept implementation demonstrates our framework's ability to detect modern UEFI threats, such as ThunderStrike, CosmicStrand, and Glupteba bootkits. By providing an open-source solution, our work enables researchers and practitioners to investigate firmware-level threats, develop additional analysis modules, and advance overall below-OS security through UEFI memory analysis.
中文: 本文提出了一种新型UEFI内存取证框架,通过内存获取和恶意活动检测组件填补了操作系统前安全分析工具的空白,能够有效识别包括引导工具包在内的现代UEFI威胁。
English: This paper introduces a novel framework for UEFI memory forensics to address the lack of pre-OS security analysis tools, featuring memory acquisition and malicious activity detection capabilities that successfully identify modern UEFI threats like bootkits.

Authors:Zifeng Wang, Lang Cao, Qiao Jin, Joey Chan, Nicholas Wan, Behdad Afzali, Hyun-Jin Cho, Chang-In Choi, Mehdi Emamverdi, Manjot K. Gill, Sun-Hyung Kim, Yijia Li, Yi Liu, Hanley Ong, Justin Rousseau, Irfan Sheikh, Jenny J. Wei, Ziyang Xu, Christopher M. Zallek, Kyungsang Kim, Yifan Peng, Zhiyong Lu, Jimeng Sun
Title: A foundation model for human-AI collaboration in medical literature mining
Abstract:
Systematic literature review is essential for evidence-based medicine, requiring comprehensive analysis of clinical trial publications. However, the application of artificial intelligence (AI) models for medical literature mining has been limited by insufficient training and evaluation across broad therapeutic areas and diverse tasks. Here, we present LEADS, an AI foundation model for study search, screening, and data extraction from medical literature. The model is trained on 633,759 instruction data points in LEADSInstruct, curated from 21,335 systematic reviews, 453,625 clinical trial publications, and 27,015 clinical trial registries. We showed that LEADS demonstrates consistent improvements over four cutting-edge generic large language models (LLMs) on six tasks. Furthermore, LEADS enhances expert workflows by providing supportive references following expert requests, streamlining processes while maintaining high-quality results. A study with 16 clinicians and medical researchers from 14 different institutions revealed that experts collaborating with LEADS achieved a recall of 0.81 compared to 0.77 experts working alone in study selection, with a time savings of 22.6%. In data extraction tasks, experts using LEADS achieved an accuracy of 0.85 versus 0.80 without using LEADS, alongside a 26.9% time savings. These findings highlight the potential of specialized medical literature foundation models to outperform generic models, delivering significant quality and efficiency benefits when integrated into expert workflows for medical literature mining.
中文: LEADS是一种AI基础模型,在医学文献挖掘中优于通用大语言模型,能提升专家在文献筛选和数据提取中的召回率与准确性,并大幅节省时间。
English: LEADS is an AI foundation model that enhances medical literature mining by outperforming generic large language models in study search, screening, and data extraction, improving expert recall and accuracy while saving significant time.

Authors:Jingtong Gao, Zhaocheng Du, Xiaopeng Li, Yichao Wang, Xiangyang Li, Huifeng Guo, Ruiming Tang, Xiangyu Zhao
Title: SampleLLM: Optimizing Tabular Data Synthesis in Recommendations
Abstract:
Tabular data synthesis is crucial in machine learning, yet existing general methods-primarily based on statistical or deep learning models-are highly data-dependent and often fall short in recommender systems. This limitation arises from their difficulty in capturing complex distributions and understanding feature relationships from sparse and limited data, along with their inability to grasp semantic feature relations. Recently, Large Language Models (LLMs) have shown potential in generating synthetic data samples through few-shot learning and semantic understanding. However, they often suffer from inconsistent distribution and lack of diversity due to their inherent distribution disparity with the target dataset. To address these challenges and enhance tabular data synthesis for recommendation tasks, we propose a novel two-stage framework named SampleLLM to improve the quality of LLM-based tabular data synthesis for recommendations by ensuring better distribution alignment. In the first stage, SampleLLM employs LLMs with Chain-of-Thought prompts and diverse exemplars to generate data that closely aligns with the target dataset distribution, even when input samples are limited. The second stage uses an advanced feature attribution-based importance sampling method to refine feature relationships within the synthesized data, reducing any distribution biases introduced by the LLM. Experimental results on three recommendation datasets, two general datasets, and online deployment illustrate that SampleLLM significantly surpasses existing methods for recommendation tasks and holds promise for a broader range of tabular data scenarios.
中文:提出的SampleLLM框架通过结合思维链提示的LLM实现数据分布对齐,并采用基于特征归因的采样优化特征关系,在实验中显著超越了现有推荐任务的表格数据合成方法。
English: The proposed SampleLLM framework enhances tabular data synthesis for recommendations by using LLMs with Chain-of-Thought prompts to align data distribution and employing feature attribution-based sampling to refine feature relationships, outperforming existing methods in experiments.

Authors:Chenyang Ren, Huanyi Xie, Shu Yang, Meng Ding, Lijie Hu, Di Wang
Title: Evaluating Data Influence in Meta Learning
Abstract:
As one of the most fundamental models, meta learning aims to effectively address few-shot learning challenges. However, it still faces significant issues related to the training data, such as training inefficiencies due to numerous low-contribution tasks in large datasets and substantial noise from incorrect labels. Thus, training data attribution methods are needed for meta learning. However, the dual-layer structure of mata learning complicates the modeling of training data contributions because of the interdependent influence between meta-parameters and task-specific parameters, making existing data influence evaluation tools inapplicable or inaccurate. To address these challenges, based on the influence function, we propose a general data attribution evaluation framework for meta-learning within the bilevel optimization framework. Our approach introduces task influence functions (task-IF) and instance influence functions (instance-IF) to accurately assess the impact of specific tasks and individual data points in closed forms. This framework comprehensively models data contributions across both the inner and outer training processes, capturing the direct effects of data points on meta-parameters as well as their indirect influence through task-specific parameters. We also provide several strategies to enhance computational efficiency and scalability. Experimental results demonstrate the framework's effectiveness in training data evaluation via several downstream tasks.
中文:元学习因其双层结构在数据归因上面临挑战,为此提出了基于任务和实例影响函数的框架,以精确评估训练过程中数据的贡献。
English: Meta learning faces challenges in data attribution due to its dual-layer structure, prompting the development of a framework using task and instance influence functions to accurately assess data contributions across training processes.

Authors:Jacob Shams, Ben Nassi, Satoru Koda, Asaf Shabtai, Yuval Elovici
Title: A Privacy Enhancing Technique to Evade Detection by Street Video Cameras Without Using Adversarial Accessories
Abstract:
In this paper, we propose a privacy-enhancing technique leveraging an inherent property of automatic pedestrian detection algorithms, namely, that the training of deep neural network (DNN) based methods is generally performed using curated datasets and laboratory settings, while the operational areas of these methods are dynamic real-world environments. In particular, we leverage a novel side effect of this gap between the laboratory and the real world: location-based weakness in pedestrian detection. We demonstrate that the position (distance, angle, height) of a person, and ambient light level, directly impact the confidence of a pedestrian detector when detecting the person. We then demonstrate that this phenomenon is present in pedestrian detectors observing a stationary scene of pedestrian traffic, with blind spot areas of weak detection of pedestrians with low confidence. We show how privacy-concerned pedestrians can leverage these blind spots to evade detection by constructing a minimum confidence path between two points in a scene, reducing the maximum confidence and average confidence of the path by up to 0.09 and 0.13, respectively, over direct and random paths through the scene. To counter this phenomenon, and force the use of more costly and sophisticated methods to leverage this vulnerability, we propose a novel countermeasure to improve the confidence of pedestrian detectors in blind spots, raising the max/average confidence of paths generated by our technique by 0.09 and 0.05, respectively. In addition, we demonstrate that our countermeasure improves a Faster R-CNN-based pedestrian detector's TPR and average true positive confidence by 0.03 and 0.15, respectively.
中文: 本文提出一种隐私增强技术,利用行人检测系统在真实环境中的位置盲区使行人能够规避检测,同时设计了一种对抗措施来提升检测器在盲区的置信度与可靠性。
English: This paper introduces a privacy technique that exploits location-based blind spots in pedestrian detection systems, allowing individuals to evade detection by following low-confidence paths, while also proposing a countermeasure to enhance detector reliability in these vulnerable areas.

Authors:Yuhong Sun, Zhangyue Yin, Xuanjing Huang, Xipeng Qiu, Hui Zhao
Title: Error Classification of Large Language Models on Math Word Problems: A Dynamically Adaptive Framework
Abstract:
Large Language Models (LLMs) have demonstrated remarkable capabilities across various domains. Math Word Problems (MWPs) serve as a crucial benchmark for evaluating LLMs' reasoning abilities. While most research primarily focuses on improving accuracy, it often neglects understanding and addressing the underlying patterns of errors. Current error classification methods rely on static and predefined categories, which limit their ability to capture the full spectrum of error patterns in mathematical reasoning. To enable systematic error analysis, we collect error samples from 15 different LLMs of varying sizes across four distinct MWP datasets using multiple sampling strategies. Based on this extensive collection, we introduce MWPES-300K, a comprehensive dataset containing 304,865 error samples that cover diverse error patterns and reasoning paths. To reduce human bias and enable fine-grained analysis of error patterns, we propose a novel framework for automated dynamic error classification in mathematical reasoning. Experimental results demonstrate that dataset characteristics significantly shape error patterns, which evolve from basic to complex manifestations as model capabilities increase. With deeper insights into error patterns, we propose Error-Aware Prompting (EAP) that incorporates common error patterns as explicit guidance, leading to significant improvements in mathematical reasoning performance.
中文摘要:本研究通过收集15种大语言模型的304,865个错误样本构建MWPES-300K数据集,提出动态错误分类框架揭示数据集特征如何影响错误模式演变,最终开发出融入常见错误模式的提示方法显著提升数学推理能力。
English Summary: This study introduces MWPES-300K, a comprehensive dataset of 304,865 error samples from 15 LLMs, and proposes a dynamic error classification framework that reveals how dataset characteristics shape evolving error patterns, ultimately enabling Error-Aware Prompting to significantly enhance mathematical reasoning performance.

Authors:Sagiv Antebi, Edan Habler, Asaf Shabtai, Yuval Elovici
Title: Tag&Tab: Pretraining Data Detection in Large Language Models Using Keyword-Based Membership Inference Attack
Abstract:
Large language models (LLMs) have become essential tools for digital task assistance. Their training relies heavily on the collection of vast amounts of data, which may include copyright-protected or sensitive information. Recent studies on detecting pretraining data in LLMs have primarily focused on sentence- or paragraph-level membership inference attacks (MIAs), usually involving probability analysis of the target model's predicted tokens. However, these methods often exhibit poor accuracy, failing to account for the semantic importance of textual content and word significance. To address these shortcomings, we propose Tag&Tab, a novel approach for detecting data used in LLM pretraining. Our method leverages established natural language processing (NLP) techniques to tag keywords in the input text, a process we term Tagging. Then, the LLM is used to obtain probabilities for these keywords and calculate their average log-likelihood to determine input text membership, a process we refer to as Tabbing. Our experiments on four benchmark datasets (BookMIA, MIMIR, PatentMIA, and the Pile) and several open-source LLMs of varying sizes demonstrate an average increase in AUC scores ranging from 5.3% to 17.6% over state-of-the-art methods. Tag&Tab not only sets a new standard for data leakage detection in LLMs, but its outstanding performance is a testament to the importance of words in MIAs on LLMs.
中文摘要:Tag&Tab方法通过标记关键词并计算其平均对数似然来改进大语言模型的数据泄露检测,相比现有技术实现了显著性能提升。
English Summary: The proposed Tag&Tab method improves data leakage detection in LLMs by tagging keywords and calculating their average log-likelihood, achieving significant performance gains over existing techniques.

Authors:Dudi Biton, Jacob Shams, Satoru Koda, Asaf Shabtai, Yuval Elovici, Ben Nassi
Title: Towards an End-to-End (E2E) Adversarial Learning and Application in the Physical World
Abstract:
The traditional learning process of patch-based adversarial attacks, conducted in the digital domain and then applied in the physical domain (e.g., via printed stickers), may suffer from reduced performance due to adversarial patches' limited transferability from the digital domain to the physical domain. Given that previous studies have considered using projectors to apply adversarial attacks, we raise the following question: can adversarial learning (i.e., patch generation) be performed entirely in the physical domain with a projector? In this work, we propose the Physical-domain Adversarial Patch Learning Augmentation (PAPLA) framework, a novel end-to-end (E2E) framework that converts adversarial learning from the digital domain to the physical domain using a projector. We evaluate PAPLA across multiple scenarios, including controlled laboratory settings and realistic outdoor environments, demonstrating its ability to ensure attack success compared to conventional digital learning-physical application (DL-PA) methods. We also analyze the impact of environmental factors, such as projection surface color, projector strength, ambient light, distance, and angle of the target object relative to the camera, on the effectiveness of projected patches. Finally, we demonstrate the feasibility of the attack against a parked car and a stop sign in a real-world outdoor environment. Our results show that under specific conditions, E2E adversarial learning in the physical domain eliminates the transferability issue and ensures evasion by object detectors. Finally, we provide insights into the challenges and opportunities of applying adversarial learning in the physical domain and explain where such an approach is more effective than using a sticker.
中文: 提出的PAPLA框架通过投影仪在物理域中实现端到端的对抗性补丁学习,有效克服了传统数字方法的可迁移性问题,并在多种现实场景中确保了攻击的成功。
English: The proposed PAPLA framework enables end-to-end adversarial patch learning directly in the physical domain using a projector, effectively overcoming the transferability issues of traditional digital methods and ensuring successful attacks across various real-world scenarios.

Authors:Chen Tang, Bo Lv, Zifan Zheng, Bohao Yang, Kun Zhao, Ning Liao, Xiaoxing Wang, Feiyu Xiong, Zhiyu Li, Nayu Liu, Jingchi Jiang
Title: GRAPHMOE: Amplifying Cognitive Depth of Mixture-of-Experts Network via Introducing Self-Rethinking Mechanism
Abstract:
Traditional Mixture-of-Experts (MoE) networks benefit from utilizing multiple smaller expert models as opposed to a single large network. However, these experts typically operate independently, leaving a question open about whether interconnecting these models could enhance the performance of MoE networks. In response, we introduce GRAPHMOE, a novel method aimed at augmenting the cognitive depth of language models via a self-rethinking mechanism constructed on Pseudo GraphMoE networks. GRAPHMOE employs a recurrent routing strategy to simulate iterative thinking steps, thereby facilitating the flow of information among expert nodes. We implement the GRAPHMOE architecture using Low-Rank Adaptation techniques (LoRA) and conduct extensive experiments on various benchmark datasets. The experimental results reveal that GRAPHMOE outperforms other LoRA based models, achieving state-of-the-art (SOTA) performance. Additionally, this study explores a novel recurrent routing strategy that may inspire further advancements in enhancing the reasoning capabilities of language models.
中文摘要:GRAPHMOE通过伪图网络和循环路由策略构建自反思机制,增强专家节点间的信息交互,在使用LoRA技术时实现了最优性能,并为提升语言模型推理能力开辟了新途径。
English Summary: GRAPHMOE introduces a self-rethinking mechanism through pseudo graph networks and recurrent routing to enhance information flow between experts, achieving state-of-the-art performance with LoRA adaptation while advancing language model reasoning capabilities.

Authors:Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Yuwei Fang, Kwot Sin Lee, Ivan Skorokhodov, Kfir Aberman, Jun-Yan Zhu, Ming-Hsuan Yang, Sergey Tulyakov
Title: Multi-subject Open-set Personalization in Video Generation
Abstract:
Video personalization methods allow us to synthesize videos with specific concepts such as people, pets, and places. However, existing methods often focus on limited domains, require time-consuming optimization per subject, or support only a single subject. We present Video Alchemist $-$ a video model with built-in multi-subject, open-set personalization capabilities for both foreground objects and background, eliminating the need for time-consuming test-time optimization. Our model is built on a new Diffusion Transformer module that fuses each conditional reference image and its corresponding subject-level text prompt with cross-attention layers. Developing such a large model presents two main challenges: dataset and evaluation. First, as paired datasets of reference images and videos are extremely hard to collect, we sample selected video frames as reference images and synthesize a clip of the target video. However, while models can easily denoise training videos given reference frames, they fail to generalize to new contexts. To mitigate this issue, we design a new automatic data construction pipeline with extensive image augmentations. Second, evaluating open-set video personalization is a challenge in itself. To address this, we introduce a personalization benchmark that focuses on accurate subject fidelity and supports diverse personalization scenarios. Finally, our extensive experiments show that our method significantly outperforms existing personalization methods in both quantitative and qualitative evaluations.
中文摘要:Video Alchemist 是一种创新的视频个性化模型,通过扩散Transformer架构和创新的数据流程,实现了无需测试时优化的多主体开放集视频定制,有效解决了数据集构建和评估难题。
English Summary: Video Alchemist is a novel video personalization model that enables multi-subject, open-set customization of both foreground and background elements without test-time optimization, utilizing a Diffusion Transformer architecture and an innovative data pipeline to overcome dataset and evaluation challenges.

Authors:Ding Zhang, Yangning Li, Lichen Bai, Hao Zhang, Yinghui Li, Haiye Lin, Hai-Tao Zheng, Xin Su, Zifei Shan
Title: Loss-Aware Curriculum Learning for Chinese Grammatical Error Correction
Abstract:
Chinese grammatical error correction (CGEC) aims to detect and correct errors in the input Chinese sentences. Recently, Pre-trained Language Models (PLMS) have been employed to improve the performance. However, current approaches ignore that correction difficulty varies across different instances and treat these samples equally, enhancing the challenge of model learning. To address this problem, we propose a multi-granularity Curriculum Learning (CL) framework. Specifically, we first calculate the correction difficulty of these samples and feed them into the model from easy to hard batch by batch. Then Instance-Level CL is employed to help the model optimize in the appropriate direction automatically by regulating the loss function. Extensive experimental results and comprehensive analyses of various datasets prove the effectiveness of our method.
中文摘要:本研究提出了一种多粒度课程学习框架,通过从易到难分批训练中文语法纠错模型,并采用实例级课程学习自动调节损失函数,有效提升了纠错性能。
English Summary: This study introduces a multi-granularity curriculum learning framework that progressively trains Chinese grammatical error correction models from easy to difficult instances while automatically optimizing learning direction through instance-level adjustments.

Authors:Hebin Wang, Yangning Li, Yinghui Li, Hai-Tao Zheng, Wenhao Jiang, Hong-Gee Kim
Title: Exploring the Implicit Semantic Ability of Multimodal Large Language Models: A Pilot Study on Entity Set Expansion
Abstract:
The rapid development of multimodal large language models (MLLMs) has brought significant improvements to a wide range of tasks in real-world applications. However, LLMs still exhibit certain limitations in extracting implicit semantic information. In this paper, we apply MLLMs to the Multi-modal Entity Set Expansion (MESE) task, which aims to expand a handful of seed entities with new entities belonging to the same semantic class, and multi-modal information is provided with each entity. We explore the capabilities of MLLMs to understand implicit semantic information at the entity-level granularity through the MESE task, introducing a listwise ranking method LUSAR that maps local scores to global rankings. Our LUSAR demonstrates significant improvements in MLLM's performance on the MESE task, marking the first use of generative MLLM for ESE tasks and extending the applicability of listwise ranking.
中文: 本文提出LUSAR列表排序方法,通过多模态实体集扩展任务增强多模态大语言模型对隐含语义信息的提取能力,显著提升了模型性能。
English: This paper introduces LUSAR, a listwise ranking method that enhances multimodal large language models' ability to extract implicit semantic information for the Multi-modal Entity Set Expansion task, achieving significant performance improvements.

Authors:Xiang Wu, Xunkai Li, Rong-Hua Li, Kangfei Zhao, Guoren Wang
Title: ScaDyG:A New Paradigm for Large-scale Dynamic Graph Learning
Abstract:
Dynamic graphs (DGs), which capture time-evolving relationships between graph entities, have widespread real-world applications. To efficiently encode DGs for downstream tasks, most dynamic graph neural networks follow the traditional message-passing mechanism and extend it with time-based techniques. Despite their effectiveness, the growth of historical interactions introduces significant scalability issues, particularly in industry scenarios. To address this limitation, we propose ScaDyG, with the core idea of designing a time-aware scalable learning paradigm as follows: 1) Time-aware Topology Reformulation: ScaDyG first segments historical interactions into time steps (intra and inter) based on dynamic modeling, enabling weight-free and time-aware graph propagation within pre-processing. 2) Dynamic Temporal Encoding: To further achieve fine-grained graph propagation within time steps, ScaDyG integrates temporal encoding through a combination of exponential functions in a scalable manner. 3) Hypernetwork-driven Message Aggregation: After obtaining the propagated features (i.e., messages), ScaDyG utilizes hypernetwork to analyze historical dependencies, implementing node-wise representation by an adaptive temporal fusion. Extensive experiments on 12 datasets demonstrate that ScaDyG performs comparably well or even outperforms other SOTA methods in both node and link-level downstream tasks, with fewer learnable parameters and higher efficiency.
中文:ScaDyG提出了一种可扩展的动态图学习范式,通过时间感知的拓扑重构和动态时序编码分割历史交互,并利用超网络驱动消息聚合,以更少参数量在下游任务中实现更高效率与性能。
English: ScaDyG introduces a scalable dynamic graph learning paradigm that segments historical interactions into time-aware steps and employs hypernetwork-driven message aggregation, achieving superior efficiency and performance in downstream tasks with fewer parameters.

Authors:Ali Tourani, Deniz Isinsu Avsar, Hriday Bavle, Jose Luis Sanchez-Lopez, Jan Lagerwall, Holger Voos
Title: Unveiling the Potential of iMarkers: Invisible Fiducial Markers for Advanced Robotics
Abstract:
Fiducial markers are widely used in various robotics tasks, facilitating enhanced navigation, object recognition, and scene understanding. Despite their advantages for robots and Augmented Reality (AR) applications, they often disrupt the visual aesthetics of environments because they are visible to humans, making them unsuitable for non-intrusive use cases. To address this gap, this paper presents "iMarkers"-innovative, unobtrusive fiducial markers detectable exclusively by robots equipped with specialized sensors. These markers offer high flexibility in production, allowing customization of their visibility range and encoding algorithms to suit various demands. The paper also introduces the hardware designs and software algorithms developed for detecting iMarkers, highlighting their adaptability and robustness in the detection and recognition stages. Various evaluations have demonstrated the effectiveness of iMarkers compared to conventional (printed) and blended fiducial markers and confirmed their applicability in diverse robotics scenarios.
中文: 本文提出“iMarkers”隐形基准标记,仅可通过机器人专用传感器检测,具备可定制可见范围与编码算法,在多种机器人场景中实现了高效稳定的检测与识别性能。
English: This paper introduces "iMarkers," unobtrusive fiducial markers detectable only by specialized robot sensors, offering customizable visibility and encoding while maintaining robust detection and recognition capabilities across various robotics applications.

Authors:Zezhou Yang, Sirong Chen, Cuiyun Gao, Zhenhao Li, Xing Hu, Kui Liu, Xin Xia
Title: An Empirical Study of Retrieval-Augmented Code Generation: Challenges and Opportunities
Abstract:
Code generation aims to automatically generate code snippets of specific programming language according to natural language descriptions. The continuous advancements in deep learning, particularly pre-trained models, have empowered the code generation task to achieve remarkable performance. One main challenge of pre-trained models for code generation is the semantic gap between natural language requirements and source code. To address the issue, prior studies typically adopt a retrieval-augmented framework for the task, where the similar code snippets collected by a retrieval process can be leveraged to help understand the requirements and provide guidance for the generation process. However, there is a lack of systematic study on the application of this framework for code generation, including the impact of the final generated results and the specific usage of the framework. In this paper, we choose three popular pre-trained code models, namely CodeGen, UniXcoder, and CodeT5, to assess the impact of the quality and utilization of retrieved code on the retrieval-augmented framework. Our analysis shows that the retrieval-augmented framework is beneficial for improving the performance of the existing pre-trained models. We also provide suggestions on the utilization of the retrieval-augmented code generation framework: BM25 and Sequential Integration Fusion are recommended due to their convenience and superior performance. Sketch Filling Fusion, which extracts a sketch of relevant code, could help the model improve its performance further. Additionally, we conduct experiments to investigate the influence of the retrieval-augmented framework on large language models for code generation, showing the effectiveness of the framework, and we discuss the trade-off between performance improvement and computational costs in each phase within the framework.
中文: 检索增强框架通过利用相似代码片段显著提升代码生成性能,其中BM25和顺序集成融合因其便捷性和优越表现被推荐使用。
English: The retrieval-augmented framework significantly enhances code generation performance by leveraging similar code snippets, with BM25 and Sequential Integration Fusion recommended for optimal results.

Authors:Zhengyu Wu, Guang Zeng, Huilin Lai, Daohan Su, Jishuo Jia, Yinlin Zhu, Xunkai Li, Rong-Hua Li, Guoren Wang, Chenghu Zhou
Title: Knowledge-Driven Federated Graph Learning on Model Heterogeneity
Abstract:
Federated graph learning (FGL) has emerged as a promising paradigm for collaborative graph representation learning, enabling multiple parties to jointly train models while preserving data privacy. However, most existing approaches assume homogeneous client models and largely overlook the challenge of model-centric heterogeneous FGL (MHtFGL), which frequently arises in practice when organizations employ graph neural networks (GNNs) of different scales and architectures.Such architectural diversity not only undermines smooth server-side aggregation, which presupposes a unified representation space shared across clients' updates, but also further complicates the transfer and integration of structural knowledge across clients. To address this issue, we propose the Federated Graph Knowledge Collaboration (FedGKC) framework. FedGKC introduces a lightweight Copilot Model on each client to facilitate knowledge exchange while local architectures are heterogeneous across clients, and employs two complementary mechanisms: Client-side Self-Mutual Knowledge Distillation, which transfers effective knowledge between local and copilot models through bidirectional distillation with multi-view perturbation; and Server-side Knowledge-Aware Model Aggregation, which dynamically assigns aggregation weights based on knowledge provided by clients. Extensive experiments on eight benchmark datasets demonstrate that FedGKC achieves an average accuracy gain of 3.74% over baselines in MHtFGL scenarios, while maintaining excellent performance in homogeneous settings.
中文: 联邦图知识协作框架通过客户端协同模型的双向知识蒸馏与服务器端知识感知聚合,解决了模型异构的联邦图学习难题,在基准测试中平均准确率比基线方法提升3.74%。
English: Federated Graph Knowledge Collaboration (FedGKC) addresses model-centric heterogeneous federated graph learning by introducing client-side copilot models for bidirectional knowledge distillation and server-side knowledge-aware aggregation, achieving 3.74% average accuracy improvement over baselines.

Authors:Xunkai Li, Bowen Fan, Zhengyu Wu, Zhiyu Li, Rong-Hua Li, Guoren Wang
Title: Toward Scalable Graph Unlearning: A Node Influence Maximization based Approach
Abstract:
Machine unlearning, as a pivotal technology for enhancing model robustness and data privacy, has garnered significant attention in prevalent web mining applications, especially in thriving graph-based scenarios. However, most existing graph unlearning (GU) approaches face significant challenges due to the intricate interactions among web-scale graph elements during the model training: (1) The gradient-driven node entanglement hinders the complete knowledge removal in response to unlearning requests; (2) The billion-level graph elements in the web scenarios present inevitable scalability issues. To break the above limitations, we open up a new perspective by drawing a connection between GU and conventional social influence maximization. To this end, we propose Node Influence Maximization (NIM) through the decoupled influence propagation model and fine-grained influence function in a scalable manner, which is crafted to be a plug-and-play strategy to identify potential nodes affected by unlearning entities. This approach enables offline execution independent of GU, allowing it to be seamlessly integrated into most GU methods to improve their unlearning performance. Based on this, we introduce Scalable Graph Unlearning (SGU) as a new fine-tuned framework, which balances the forgetting and reasoning capability of the unlearned model by entity-specific optimizations. Extensive experiments on 14 datasets, including large-scale ogbn-papers100M, have demonstrated the effectiveness of our approach. Specifically, NIM enhances the forgetting capability of most GU methods, while SGU achieves comprehensive SOTA performance and maintains scalability.
中文摘要:针对图数据机器遗忘中节点纠缠与可扩展性难题,本研究通过节点影响力最大化(NIM)识别待遗忘实体相关节点,并构建可扩展图遗忘框架(SGU),在保持模型推理能力的同时实现高效知识擦除。
English Summary: Machine unlearning in graph-based web applications faces challenges with node entanglement and scalability, which the proposed Node Influence Maximization (NIM) and Scalable Graph Unlearning (SGU) framework address by identifying affected nodes and balancing model capabilities through entity-specific optimizations.

Authors:Xunkai Li, Daohan Su, Zhengyu Wu, Guang Zeng, Hongchao Qin, Rong-Hua Li, Guoren Wang
Title: Toward Effective Digraph Representation Learning: A Magnetic Adaptive Propagation based Approach
Abstract:
The $q$-parameterized magnetic Laplacian serves as the foundation of directed graph (digraph) convolution, enabling this kind of digraph neural network (MagDG) to encode node features and structural insights by complex-domain message passing. As a generalization of undirected methods, MagDG shows superior capability in modeling intricate web-scale topology. Despite the great success achieved by existing MagDGs, limitations still exist: (1) Hand-crafted $q$: The performance of MagDGs depends on selecting an appropriate $q$-parameter to construct suitable graph propagation equations in the complex domain. This parameter tuning, driven by downstream tasks, limits model flexibility and significantly increases manual effort. (2) Coarse Message Passing: Most approaches treat all nodes with the same complex-domain propagation and aggregation rules, neglecting their unique digraph contexts. This oversight results in sub-optimal performance. To address the above issues, we propose two key techniques: (1) MAP is crafted to be a plug-and-play complex-domain propagation optimization strategy in the context of digraph learning, enabling seamless integration into any MagDG to improve predictions while enjoying high running efficiency. (2) MAP++ is a new digraph learning framework, further incorporating a learnable mechanism to achieve adaptively edge-wise propagation and node-wise aggregation in the complex domain for better performance. Extensive experiments on 12 datasets demonstrate that MAP enjoys flexibility for it can be incorporated with any MagDG, and scalability as it can deal with web-scale digraphs. MAP++ achieves SOTA predictive performance on 4 different downstream tasks.
中文: $q$参数化磁拉普拉斯算子虽能支持有向图神经网络建模复杂拓扑,但存在手动调参和粗粒度消息传递的局限,为此提出的MAP优化策略和MAP++框架实现了自适应复数域传播以提升性能。
English: The $q$-parameterized magnetic Laplacian enables directed graph neural networks (MagDG) to model complex topologies but faces limitations in manual parameter tuning and uniform message passing, which are addressed by the proposed MAP optimization strategy and MAP++ framework for adaptive complex-domain propagation.

Authors:Xiaohui Li, Yihao Liu, Shuo Cao, Ziyan Chen, Shaobin Zhuang, Xiangyu Chen, Yinan He, Yi Wang, Yu Qiao
Title: DiffVSR: Revealing an Effective Recipe for Taming Robust Video Super-Resolution Against Complex Degradations
Abstract:
Diffusion models have demonstrated exceptional capabilities in image restoration, yet their application to video super-resolution (VSR) faces significant challenges in balancing fidelity with temporal consistency. Our evaluation reveals a critical gap: existing approaches consistently fail on severely degraded videos--precisely where diffusion models' generative capabilities are most needed. We identify that existing diffusion-based VSR methods struggle primarily because they face an overwhelming learning burden: simultaneously modeling complex degradation distributions, content representations, and temporal relationships with limited high-quality training data. To address this fundamental challenge, we present DiffVSR, featuring a Progressive Learning Strategy (PLS) that systematically decomposes this learning burden through staged training, enabling superior performance on complex degradations. Our framework additionally incorporates an Interweaved Latent Transition (ILT) technique that maintains competitive temporal consistency without additional training overhead. Experiments demonstrate that our approach excels in scenarios where competing methods struggle, particularly on severely degraded videos. Our work reveals that addressing the learning strategy, rather than focusing solely on architectural complexity, is the critical path toward robust real-world video super-resolution with diffusion models.
中文摘要:本研究提出的DiffVSR框架通过渐进式学习策略和交织潜在转换技术,系统化解除了扩散模型在视频超分辨率中的学习负担,在严重退化视频上实现了卓越性能,揭示了学习策略优化比架构复杂性更关键的本质。
English Summary: The study introduces DiffVSR, a novel diffusion-based video super-resolution framework that overcomes existing limitations through a Progressive Learning Strategy and Interweaved Latent Transition technique, achieving superior performance on severely degraded videos by systematically addressing the learning burden.

Authors:Shiu-hong Kao, Xiao Li, Jinglu Wang, Yang Li, Chi-Keung Tang, Yu-Wing Tai, Yan Lu
Title: UVRM: A Scalable 3D Reconstruction Model from Unposed Videos
Abstract:
Large Reconstruction Models (LRMs) have recently become a popular method for creating 3D foundational models. Training 3D reconstruction models with 2D visual data traditionally requires prior knowledge of camera poses for the training samples, a process that is both time-consuming and prone to errors. Consequently, 3D reconstruction training has been confined to either synthetic 3D datasets or small-scale datasets with annotated poses. In this study, we investigate the feasibility of 3D reconstruction using unposed video data of various objects. We introduce UVRM, a novel 3D reconstruction model capable of being trained and evaluated on monocular videos without requiring any information about the pose. UVRM uses a transformer network to implicitly aggregate video frames into a pose-invariant latent feature space, which is then decoded into a tri-plane 3D representation. To obviate the need for ground-truth pose annotations during training, UVRM employs a combination of the score distillation sampling (SDS) method and an analysis-by-synthesis approach, progressively synthesizing pseudo novel-views using a pre-trained diffusion model. We qualitatively and quantitatively evaluate UVRM's performance on the G-Objaverse and CO3D datasets without relying on pose information. Extensive experiments show that UVRM is capable of effectively and efficiently reconstructing a wide range of 3D objects from unposed videos.
中文摘要:本研究提出了UVRM模型,通过基于变换器的特征聚合和结合分数蒸馏采样的分析合成方法,能够在无需相机姿态标注的单目视频上实现高效的三维重建,有效扩展了三维重建的应用范围。
English Summary: The study introduces UVRM, a novel 3D reconstruction model that eliminates the need for camera pose annotations by training on unposed monocular videos through transformer-based feature aggregation and a combination of score distillation sampling with analysis-by-synthesis techniques.

Authors:Weichen Fan, Chenyang Si, Junhao Song, Zhenyu Yang, Yinan He, Long Zhuo, Ziqi Huang, Ziyue Dong, Jingwen He, Dongwei Pan, Yi Wang, Yuming Jiang, Yaohui Wang, Peng Gao, Xinyuan Chen, Hengjie Li, Dahua Lin, Yu Qiao, Ziwei Liu
Title: Vchitect-2.0: Parallel Transformer for Scaling Up Video Diffusion Models
Abstract:
We present Vchitect-2.0, a parallel transformer architecture designed to scale up video diffusion models for large-scale text-to-video generation. The overall Vchitect-2.0 system has several key designs. (1) By introducing a novel Multimodal Diffusion Block, our approach achieves consistent alignment between text descriptions and generated video frames, while maintaining temporal coherence across sequences. (2) To overcome memory and computational bottlenecks, we propose a Memory-efficient Training framework that incorporates hybrid parallelism and other memory reduction techniques, enabling efficient training of long video sequences on distributed systems. (3) Additionally, our enhanced data processing pipeline ensures the creation of Vchitect T2V DataVerse, a high-quality million-scale training dataset through rigorous annotation and aesthetic evaluation. Extensive benchmarking demonstrates that Vchitect-2.0 outperforms existing methods in video quality, training efficiency, and scalability, serving as a suitable base for high-fidelity video generation.
中文: Vchitect-2.0采用并行Transformer架构,通过多模态扩散块确保文本与视频帧的对齐和时序连贯性,结合内存高效训练框架提升长序列处理能力,并利用增强数据管道构建高质量数据集,在文本生成视频任务中实现了卓越的视频质量与训练效率。
English: Vchitect-2.0 introduces a parallel transformer architecture with a Multimodal Diffusion Block for text-video alignment and temporal coherence, a Memory-efficient Training framework for scalable long-sequence processing, and an enhanced data pipeline, achieving superior video quality and efficiency in text-to-video generation.

Authors:Phai Vu Dinh, Diep N. Nguyen, Dinh Thai Hoang, Quang Uy Nguyen, Eryk Dutkiewicz
Title: Multiple-Input Variational Auto-Encoder for Anomaly Detection in Heterogeneous Data
Abstract:
Anomaly detection (AD) plays a pivotal role in AI applications, e.g., in classification, and intrusion/threat detection in cybersecurity. However, most existing methods face challenges of heterogeneity amongst feature subsets posed by non-independent and identically distributed (non-IID) data. We propose a novel neural network model called Multiple-Input Auto-Encoder for AD (MIAEAD) to address this. MIAEAD assigns an anomaly score to each feature subset of a data sample to indicate its likelihood of being an anomaly. This is done by using the reconstruction error of its sub-encoder as the anomaly score. All sub-encoders are then simultaneously trained using unsupervised learning to determine the anomaly scores of feature subsets. The final AUC of MIAEAD is calculated for each sub-dataset, and the maximum AUC obtained among the sub-datasets is selected. To leverage the modelling of the distribution of normal data to identify anomalies of the generative models, we develop a novel neural network architecture/model called Multiple-Input Variational Auto-Encoder (MIVAE). MIVAE can process feature subsets through its sub-encoders before learning distribution of normal data in the latent space. This allows MIVAE to identify anomalies that deviate from the learned distribution. We theoretically prove that the difference in the average anomaly score between normal samples and anomalies obtained by the proposed MIVAE is greater than that of the Variational Auto-Encoder (VAEAD), resulting in a higher AUC for MIVAE. Extensive experiments on eight real-world anomaly datasets demonstrate the superior performance of MIAEAD and MIVAE over conventional methods and the state-of-the-art unsupervised models, by up to 6% in terms of AUC score. Alternatively, MIAEAD and MIVAE have a high AUC when applied to feature subsets with low heterogeneity based on the coefficient of variation (CV) score.
Chinese: 作者提出了MIAEAD和MIVAE两种新型神经网络模型,通过为特征子集分配异常分数来解决非独立同分布数据中的异质性问题,在异常检测中比现有方法的AUC最高提升6%。
English: The authors propose two novel neural network models, MIAEAD and MIVAE, to address heterogeneity in non-IID data for anomaly detection by assigning anomaly scores to feature subsets, achieving up to 6% higher AUC than existing methods.

Authors:Siran Chen, Yuxiao Luo, Yue Ma, Yu Qiao, Yali Wang
Title: H-MBA: Hierarchical MamBa Adaptation for Multi-Modal Video Understanding in Autonomous Driving
Abstract:
With the prevalence of Multimodal Large Language Models(MLLMs), autonomous driving has encountered new opportunities and challenges. In particular, multi-modal video understanding is critical to interactively analyze what will happen in the procedure of autonomous driving. However, videos in such a dynamical scene that often contains complex spatial-temporal movements, which restricts the generalization capacity of the existing MLLMs in this field. To bridge the gap, we propose a novel Hierarchical Mamba Adaptation (H-MBA) framework to fit the complicated motion changes in autonomous driving videos. Specifically, our H-MBA consists of two distinct modules, including Context Mamba (C-Mamba) and Query Mamba (Q-Mamba). First, C-Mamba contains various types of structure state space models, which can effectively capture multi-granularity video context for different temporal resolutions. Second, Q-Mamba flexibly transforms the current frame as the learnable query, and attentively selects multi-granularity video context into query. Consequently, it can adaptively integrate all the video contexts of multi-scale temporal resolutions to enhance video understanding. Via a plug-and-play paradigm in MLLMs, our H-MBA shows the remarkable performance on multi-modal video tasks in autonomous driving, e.g., for risk object detection, it outperforms the previous SOTA method with 5.5% mIoU improvement.
中文: 提出的分层Mamba适应(H-MBA)框架通过上下文Mamba和查询Mamba模块,有效解决了现有多模态大语言模型在自动驾驶视频中理解复杂时空运动的局限性,实现了显著的性能提升。
English: The proposed Hierarchical Mamba Adaptation (H-MBA) framework addresses the limitations of existing Multimodal Large Language Models in understanding complex spatial-temporal movements in autonomous driving videos, achieving significant performance improvements through its Context Mamba and Query Mamba modules.

Authors:Bowen Fan, Yuming Ai, Xunkai Li, Zhilin Guo, Rong-Hua Li, Guoren Wang
Title: OpenGU: A Comprehensive Benchmark for Graph Unlearning
Abstract:
Graph Machine Learning is essential for understanding and analyzing relational data. However, privacy-sensitive applications demand the ability to efficiently remove sensitive information from trained graph neural networks (GNNs), avoiding the unnecessary time and space overhead caused by retraining models from scratch. To address this issue, Graph Unlearning (GU) has emerged as a critical solution, with the potential to support dynamic graph updates in data management systems and enable scalable unlearning in distributed data systems while ensuring privacy compliance. Unlike machine unlearning in computer vision or other fields, GU faces unique difficulties due to the non-Euclidean nature of graph data and the recursive message-passing mechanism of GNNs. Additionally, the diversity of downstream tasks and the complexity of unlearning requests further amplify these challenges. Despite the proliferation of diverse GU strategies, the absence of a benchmark providing fair comparisons for GU, and the limited flexibility in combining downstream tasks and unlearning requests, have yielded inconsistencies in evaluations, hindering the development of this domain. To fill this gap, we present OpenGU, the first GU benchmark, where 16 SOTA GU algorithms and 37 multi-domain datasets are integrated, enabling various downstream tasks with 13 GNN backbones when responding to flexible unlearning requests. Based on this unified benchmark framework, we are able to provide a comprehensive and fair evaluation for GU. Through extensive experimentation, we have drawn $8$ crucial conclusions about existing GU methods, while also gaining valuable insights into their limitations, shedding light on potential avenues for future research.
中文摘要:图机器学习中的遗忘学习(GU)旨在高效移除图神经网络中的敏感数据而无需重新训练,而OpenGU基准的提出为公平评估GU方法提供了统一框架,通过大量实验揭示了关键结论和未来研究方向。
English Summary: Graph Unlearning (GU) addresses the need to efficiently remove sensitive data from trained graph neural networks without full retraining, and the OpenGU benchmark is introduced to provide a unified framework for fair evaluation of GU methods, revealing key insights through extensive experiments.

Authors:Teng Li, Xingjun Ma, Yu-Gang Jiang
Title: AIM: Additional Image Guided Generation of Transferable Adversarial Attacks
Abstract:
Transferable adversarial examples highlight the vulnerability of deep neural networks (DNNs) to imperceptible perturbations across various real-world applications. While there have been notable advancements in untargeted transferable attacks, targeted transferable attacks remain a significant challenge. In this work, we focus on generative approaches for targeted transferable attacks. Current generative attacks focus on reducing overfitting to surrogate models and the source data domain, but they often overlook the importance of enhancing transferability through additional semantics. To address this issue, we introduce a novel plug-and-play module into the general generator architecture to enhance adversarial transferability. Specifically, we propose a \emph{Semantic Injection Module} (SIM) that utilizes the semantics contained in an additional guiding image to improve transferability. The guiding image provides a simple yet effective method to incorporate target semantics from the target class to create targeted and highly transferable attacks. Additionally, we propose new loss formulations that can integrate the semantic injection module more effectively for both targeted and untargeted attacks. We conduct comprehensive experiments under both targeted and untargeted attack settings to demonstrate the efficacy of our proposed approach.
中文: 本文提出了一种语义注入模块(SIM),通过引入引导图像中的目标语义来增强对抗样本的迁移性,并设计了新的损失函数以改进现有生成式攻击在定向和非定向场景中的效果。
English: This paper introduces a Semantic Injection Module (SIM) that enhances adversarial transferability by incorporating target semantics from guiding images, addressing limitations in current generative attacks through novel loss formulations for both targeted and untargeted settings.

Authors:Xinhao Li, Yi Wang, Jiashuo Yu, Xiangyu Zeng, Yuhan Zhu, Haian Huang, Jianfei Gao, Kunchang Li, Yinan He, Chenting Wang, Yu Qiao, Yali Wang, Limin Wang
Title: VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling
Abstract:
Long-context video modeling is critical for multimodal large language models (MLLMs), enabling them to process movies, online video streams, and so on. Despite its advances, handling long videos remains challenging due to the difficulty in efficiently understanding the extremely long video context. This paper aims to address this issue from aspects of model architecture, training data, training strategy and evaluation benchmark. First, we propose a novel Hierarchical video token Compression (HiCo) method, which leverages visual redundancy in long videos to compress long video context from Clip-level to Video-level, reducing the computation significantly while preserving essential details, achieving an extreme compression ratio of approximately 1/50 with almost no performance loss. Second, we introduce a multi-stage short-to-long learning scheme, a large-scale dataset of real-world long videos named LongVid, and a challenging ``Multi-Hop Needle-In-A-Video-Haystack'' benchmark. Finally, we build a powerful video MLLM named VideoChat-Flash, which shows a leading performance on both mainstream long and short video benchmarks at the 2B and 7B model scale. It first gets 99.1% accuracy over 10,000 frames in NIAH among open-source models.
中文: 本文提出VideoChat-Flash多模态大模型,通过分层视频令牌压缩方法和多阶段学习方案解决长视频处理难题,在极低压缩损失下实现主流视频基准测试领先性能。
English: This paper introduces VideoChat-Flash, a multimodal large language model that tackles long-context video challenges through a Hierarchical video token Compression method and a multi-stage training scheme, achieving top performance on benchmarks with minimal performance loss at extreme compression.

Authors:Ryan McKenna, Yangsibo Huang, Amer Sinha, Borja Balle, Zachary Charles, Christopher A. Choquette-Choo, Badih Ghazi, George Kaissis, Ravi Kumar, Ruibo Liu, Da Yu, Chiyuan Zhang
Title: Scaling Laws for Differentially Private Language Models
Abstract:
Scaling laws have emerged as important components of large language model (LLM) training as they can predict performance gains through scale, and provide guidance on important hyper-parameter choices that would otherwise be expensive. LLMs also rely on large, high-quality training datasets, like those sourced from (sometimes sensitive) user data. Training models on this sensitive user data requires careful privacy protections like differential privacy (DP). However, the dynamics of DP training are significantly different, and consequently their scaling laws are not yet fully understood. In this work, we establish scaling laws that accurately model the intricacies of DP LLM training, providing a complete picture of the compute-privacy-utility tradeoffs and the optimal training configurations in many settings.
中文: 本研究建立了能够准确模拟差分隐私大语言模型训练的扩展定律,全面揭示了计算能力、隐私保护与模型效用之间的权衡关系及多种场景下的最优训练配置。
English: This study establishes scaling laws that accurately model differentially private large language model training, providing a comprehensive understanding of compute-privacy-utility tradeoffs and optimal configurations.

Authors:Xinhao Zhang, Jinghan Zhang, Fengran Mo, Dongjie Wang, Yanjie Fu, Kunpeng Liu
Title: LEKA:LLM-Enhanced Knowledge Augmentation
Abstract:
Humans excel in analogical learning and knowledge transfer and, more importantly, possess a unique understanding of identifying appropriate sources of knowledge. From a model's perspective, this presents an interesting challenge. If models could autonomously retrieve knowledge useful for transfer or decision-making to solve problems, they would transition from passively acquiring to actively accessing and learning from knowledge. However, filling models with knowledge is relatively straightforward -- it simply requires more training and accessible knowledge bases. The more complex task is teaching models about which knowledge can be analogized and transferred. Therefore, we design a knowledge augmentation method, LEKA, for knowledge transfer that actively searches for suitable knowledge sources that can enrich the target domain's knowledge. This LEKA method extracts key information from the target domain's textual information, retrieves pertinent data from external data libraries, and harmonizes retrieved data with the target domain data in feature space and marginal probability measures. We validate the effectiveness of our approach through extensive experiments across various domains and demonstrate significant improvements over traditional methods in reducing computational costs, automating data alignment, and optimizing transfer learning outcomes.
中文: LEKA方法通过主动搜索和整合相关知识源来增强目标领域的学习,相比传统方法在迁移学习中实现了更高的效率和自动化水平。
English: The LEKA method actively identifies and retrieves relevant knowledge sources to enhance target domain learning, achieving superior efficiency and automation in transfer learning compared to conventional approaches.

Authors:Zhengpeng Xie, Jiahang Cao, Yulong Zhang, Qiang Zhang, Renjing Xu
Title: A Dual-Agent Adversarial Framework for Robust Generalization in Deep Reinforcement Learning
Abstract:
Recently, empowered with the powerful capabilities of neural networks, reinforcement learning (RL) has successfully tackled numerous challenging tasks. However, while these models demonstrate enhanced decision-making abilities, they are increasingly prone to overfitting. For instance, a trained RL model often fails to generalize to even minor variations of the same task, such as a change in background color or other minor semantic differences. To address this issue, we propose a dual-agent adversarial policy learning framework, which allows agents to spontaneously learn the underlying semantics without introducing any human prior knowledge. Specifically, our framework involves a game process between two agents: each agent seeks to maximize the impact of perturbing on the opponent's policy by producing representation differences for the same state, while maintaining its own stability against such perturbations. This interaction encourages agents to learn generalizable policies, capable of handling irrelevant features from the high-dimensional observations. Extensive experimental results on the Procgen benchmark demonstrate that the adversarial process significantly improves the generalization performance of both agents, while also being applied to various RL algorithms, e.g., Proximal Policy Optimization (PPO). With the adversarial framework, the RL agent outperforms the baseline methods by a significant margin, especially in hard-level tasks, marking a significant step forward in the generalization capabilities of deep reinforcement learning.
Chinese: 提出了一种双智能体对抗策略学习框架,通过智能体间的相互扰动学习潜在语义,显著提升了强化学习的泛化能力,在Procgen基准测试中明显优于基线方法。
English: A dual-agent adversarial policy learning framework is proposed to enhance the generalization of reinforcement learning by enabling agents to learn underlying semantics through mutual perturbation, significantly outperforming baselines on the Procgen benchmark.

Authors:Yanping Wu, Yanyong Huang, Zhengzhang Chen, Zijun Yao, Yanjie Fu, Kunpeng Liu, Xiao Luo, Dongjie Wang
Title: Iterative Feature Space Optimization through Incremental Adaptive Evaluation
Abstract:
Iterative feature space optimization involves systematically evaluating and adjusting the feature space to improve downstream task performance. However, existing works suffer from three key limitations:1) overlooking differences among data samples leads to evaluation bias; 2) tailoring feature spaces to specific machine learning models results in overfitting and poor generalization; 3) requiring the evaluator to be retrained from scratch during each optimization iteration significantly reduces the overall efficiency of the optimization process. To bridge these gaps, we propose a gEneralized Adaptive feature Space Evaluator (EASE) to efficiently produce optimal and generalized feature spaces. This framework consists of two key components: Feature-Sample Subspace Generator and Contextual Attention Evaluator. The first component aims to decouple the information distribution within the feature space to mitigate evaluation bias. To achieve this, we first identify features most relevant to prediction tasks and samples most challenging for evaluation based on feedback from the subsequent evaluator. This decoupling strategy makes the evaluator consistently target the most challenging aspects of the feature space. The second component intends to incrementally capture evolving patterns of the feature space for efficient evaluation. We propose a weighted-sharing multi-head attention mechanism to encode key characteristics of the feature space into an embedding vector for evaluation. Moreover, the evaluator is updated incrementally, retaining prior evaluation knowledge while incorporating new insights, as consecutive feature spaces during the optimization process share partial information. Extensive experiments on fourteen real-world datasets demonstrate the effectiveness of the proposed framework. Our code and data are publicly available.
中文: 提出的EASE框架通过特征-样本信息解耦减少评估偏差,并采用加权共享注意力机制进行增量评估,有效解决了迭代特征空间优化中的泛化性和效率问题。
English: The proposed EASE framework addresses limitations in iterative feature space optimization by decoupling feature-sample information to reduce bias and using incremental evaluation with a weighted-sharing attention mechanism for efficiency and generalization.

Authors:Lei Huang, Xiaocheng Feng, Weitao Ma, Yuchun Fan, Xiachong Feng, Yangfan Ye, Weihong Zhong, Yuxuan Gu, Baoxin Wang, Dayong Wu, Guoping Hu, Bing Qin
Title: Improving Contextual Faithfulness of Large Language Models via Retrieval Heads-Induced Optimization
Abstract:
Ensuring contextual faithfulness in retrieval-augmented large language models (LLMs) is crucial for building trustworthy information-seeking systems, particularly in long-form question-answering (LFQA) scenarios. In this work, we identify a salient correlation between LFQA faithfulness and retrieval heads, a set of attention heads responsible for retrieving contextual information. Leveraging this insight, we propose RHIO, a framework designed to teach LLMs to explicitly discriminate between faithful and unfaithful generations. RHIO first augments unfaithful samples that simulate realistic model-intrinsic errors by selectively masking retrieval heads. Then, these samples are incorporated into joint training, enabling the model to distinguish unfaithful outputs from faithful ones conditioned on control tokens. Furthermore, these control tokens are leveraged to self-induce contrastive outputs, amplifying their difference through contrastive decoding. Additionally, to facilitate the evaluation of contextual faithfulness, we also introduce GroundBench, a comprehensive benchmark compiled from five existing LFQA datasets. Extensive experimental results on GroundBench demonstrate that RHIO significantly improves faithfulness, even outperforming GPT-4o.
Chinese: 本研究提出RHIO框架,通过利用检索头生成不忠实样本进行联合训练,并结合控制令牌与对比解码,显著提升了检索增强大语言模型在长文本问答中的上下文忠实度,在新建的GroundBench基准测试中表现优异,甚至超越了GPT-4o。
English: This study introduces RHIO, a framework that enhances contextual faithfulness in retrieval-augmented LLMs for long-form question-answering by leveraging retrieval heads to generate unfaithful samples for joint training and employing control tokens with contrastive decoding, validated on the new GroundBench benchmark where it surpasses GPT-4o.

Authors:Zhenliang Ni, Qiangyu Yan, Mouxiao Huang, Tianning Yuan, Yehui Tang, Hailin Hu, Xinghao Chen, Yunhe Wang
Title: GenVidBench: A Challenging Benchmark for Detecting AI-Generated Video
Abstract:
The rapid advancement of video generation models has made it increasingly challenging to distinguish AI-generated videos from real ones. This issue underscores the urgent need for effective AI-generated video detectors to prevent the dissemination of false information through such videos. However, the development of high-performance generative video detectors is currently impeded by the lack of large-scale, high-quality datasets specifically designed for generative video detection. To this end, we introduce GenVidBench, a challenging AI-generated video detection dataset with several key advantages: 1) Cross Source and Cross Generator: The cross-generation source mitigates the interference of video content on the detection. The cross-generator ensures diversity in video attributes between the training and test sets, preventing them from being overly similar. 2) State-of-the-Art Video Generators: The dataset includes videos from 8 state-of-the-art AI video generators, ensuring that it covers the latest advancements in the field of video generation. 3) Rich Semantics: The videos in GenVidBench are analyzed from multiple dimensions and classified into various semantic categories based on their content. This classification ensures that the dataset is not only large but also diverse, aiding in the development of more generalized and effective detection models. We conduct a comprehensive evaluation of different advanced video generators and present a challenging setting. Additionally, we present rich experimental results including advanced video classification models as baselines. With the GenVidBench, researchers can efficiently develop and evaluate AI-generated video detection models. Datasets and code are available at https://genvidbench.github.io.
中文: GenVidBench数据集的推出解决了AI生成视频检测领域缺乏专业数据集的问题,它通过整合多种先进生成器的高质量视频,为开发有效的检测模型提供了重要支持。
English: The introduction of GenVidBench addresses the critical shortage of specialized datasets for AI-generated video detection by providing a diverse, large-scale collection from advanced generators, enabling effective model development and evaluation.

Authors:Erle Zhu, Yadi Liu, Zhe Zhang, Xujun Li, Jin Zhou, Xinjie Yu, Minlie Huang, Hongning Wang
Title: MAPS: Advancing Multi-Modal Reasoning in Expert-Level Physical Science
Abstract:
Pre-trained on extensive text and image corpora, current Multi-Modal Large Language Models (MLLM) have shown strong capabilities in general visual reasoning tasks. However, their performance is still lacking in physical domains that require understanding diagrams with complex physical structures and quantitative analysis based on multi-modal information. To address this, we develop a new framework, named Multi-Modal Scientific Reasoning with Physics Perception and Simulation (MAPS) based on an MLLM. MAPS decomposes expert-level multi-modal reasoning task into physical diagram understanding via a Physical Perception Model (PPM) and reasoning with physical knowledge via a simulator. The PPM module is obtained by fine-tuning a visual language model using carefully designed synthetic data with paired physical diagrams and corresponding simulation language descriptions. At the inference stage, MAPS integrates the simulation language description of the input diagram provided by PPM and results obtained through a Chain-of-Simulation process with MLLM to derive the underlying rationale and the final answer. Validated using our collected college-level circuit analysis problems, MAPS significantly improves reasoning accuracy of MLLM and outperforms all existing models. The results confirm MAPS offers a promising direction for enhancing multi-modal scientific reasoning ability of MLLMs. We will release our code, model and dataset used for our experiments upon publishing of this paper.
中文: 当前多模态大语言模型在通用视觉推理中表现优异,但在需要理解复杂物理结构和多模态定量分析的领域仍有不足;为此开发的MAPS框架通过物理感知与模拟集成,显著提升了科学推理准确性,并在电路分析问题中得到验证。
English: Current Multi-Modal Large Language Models (MLLMs) excel in general visual reasoning but struggle with physical domains, prompting the development of the MAPS framework, which integrates physical perception and simulation to significantly enhance scientific reasoning accuracy, as validated by circuit analysis problems.

Authors:Dongjie Wang, Yanyong Huang, Wangyang Ying, Haoyue Bai, Nanxu Gong, Xinyuan Wang, Sixun Dong, Tao Zhe, Kunpeng Liu, Meng Xiao, Pengfei Wang, Pengyang Wang, Hui Xiong, Yanjie Fu
Title: Towards Data-Centric AI: A Comprehensive Survey of Traditional, Reinforcement, and Generative Approaches for Tabular Data Transformation
Abstract:
Tabular data is one of the most widely used formats across industries, driving critical applications in areas such as finance, healthcare, and marketing. In the era of data-centric AI, improving data quality and representation has become essential for enhancing model performance, particularly in applications centered around tabular data. This survey examines the key aspects of tabular data-centric AI, emphasizing feature selection and feature generation as essential techniques for data space refinement. We provide a systematic review of feature selection methods, which identify and retain the most relevant data attributes, and feature generation approaches, which create new features to simplify the capture of complex data patterns. This survey offers a comprehensive overview of current methodologies through an analysis of recent advancements, practical applications, and the strengths and limitations of these techniques. Finally, we outline open challenges and suggest future perspectives to inspire continued innovation in this field.
中文: 本综述全面探讨了以表格数据为中心的人工智能,重点介绍了特征选择和特征生成技术以提升数据质量和模型性能,同时分析了现有方法、应用及未来挑战。
English: This survey provides a comprehensive overview of tabular data-centric AI, focusing on feature selection and generation techniques to enhance data quality and model performance, while also discussing current methodologies, applications, and future challenges.

Authors:Wenqi Fan, Yi Zhou, Shijie Wang, Yuyao Yan, Hui Liu, Qian Zhao, Le Song, Qing Li
Title: Computational Protein Science in the Era of Large Language Models (LLMs)
Abstract:
Considering the significance of proteins, computational protein science has always been a critical scientific field, dedicated to revealing knowledge and developing applications within the protein sequence-structure-function paradigm. In the last few decades, Artificial Intelligence (AI) has made significant impacts in computational protein science, leading to notable successes in specific protein modeling tasks. However, those previous AI models still meet limitations, such as the difficulty in comprehending the semantics of protein sequences, and the inability to generalize across a wide range of protein modeling tasks. Recently, LLMs have emerged as a milestone in AI due to their unprecedented language processing & generalization capability. They can promote comprehensive progress in fields rather than solving individual tasks. As a result, researchers have actively introduced LLM techniques in computational protein science, developing protein Language Models (pLMs) that skillfully grasp the foundational knowledge of proteins and can be effectively generalized to solve a diversity of sequence-structure-function reasoning problems. While witnessing prosperous developments, it's necessary to present a systematic overview of computational protein science empowered by LLM techniques. First, we summarize existing pLMs into categories based on their mastered protein knowledge, i.e., underlying sequence patterns, explicit structural and functional information, and external scientific languages. Second, we introduce the utilization and adaptation of pLMs, highlighting their remarkable achievements in promoting protein structure prediction, protein function prediction, and protein design studies. Then, we describe the practical application of pLMs in antibody design, enzyme design, and drug discovery. Finally, we specifically discuss the promising future directions in this fast-growing field.
中文: 人工智能,特别是大语言模型,正通过蛋白质语言模型掌握蛋白质知识并泛化解决多种序列-结构-功能问题,从而革新计算蛋白质科学,本文系统综述了其分类、应用及未来方向。
English: Artificial Intelligence, particularly large language models (LLMs), is revolutionizing computational protein science by enabling protein language models (pLMs) to master protein knowledge and generalize across diverse sequence-structure-function tasks, prompting a systematic review of their categorization, applications, and future directions.

Authors:Dareen Alharthi, Mahsa Zamani, Bhiksha Raj, Rita Singh
Title: Tessellated Linear Model for Age Prediction from Voice
Abstract:
Voice biometric tasks, such as age estimation require modeling the often complex relationship between voice features and the biometric variable. While deep learning models can handle such complexity, they typically require large amounts of accurately labeled data to perform well. Such data are often scarce for biometric tasks such as voice-based age prediction. On the other hand, simpler models like linear regression can work with smaller datasets but often fail to generalize to the underlying non-linear patterns present in the data. In this paper we propose the Tessellated Linear Model (TLM), a piecewise linear approach that combines the simplicity of linear models with the capacity of non-linear functions. TLM tessellates the feature space into convex regions and fits a linear model within each region. We optimize the tessellation and the linear models using a hierarchical greedy partitioning. We evaluated TLM on the TIMIT dataset on the task of age prediction from voice, where it outperformed state-of-the-art deep learning models.
Chinese: 镶嵌线性模型(TLM)通过将特征空间分割为凸区域并在每个区域内拟合线性模型,结合了线性模型的简洁性和非线性函数的能力,在TIMIT数据集上的语音年龄预测任务中表现优于先进的深度学习模型。
English: The Tessellated Linear Model (TLM) is a piecewise linear approach that combines the simplicity of linear models with the capacity for non-linearity by tessellating the feature space and fitting linear models within each region, outperforming deep learning models in voice-based age prediction on the TIMIT dataset.

Authors:Haoyu Han, Yaochen Xie, Hui Liu, Xianfeng Tang, Sreyashi Nag, William Headden, Hui Liu, Yang Li, Chen Luo, Shuiwang Ji, Qi He, Jiliang Tang
Title: Reasoning with Graphs: Structuring Implicit Knowledge to Enhance LLMs Reasoning
Abstract:
Large language models (LLMs) have demonstrated remarkable success across a wide range of tasks; however, they still encounter challenges in reasoning tasks that require understanding and inferring relationships between distinct pieces of information within text sequences. This challenge is particularly pronounced in tasks involving multi-step processes, such as logical reasoning and multi-hop question answering, where understanding implicit relationships between entities and leveraging multi-hop connections in the given context are crucial. Graphs, as fundamental data structures, explicitly represent pairwise relationships between entities, thereby offering the potential to enhance LLMs' reasoning capabilities. External graphs have proven effective in supporting LLMs across multiple tasks. However, in many reasoning tasks, no pre-existing graph structure is provided. Can we structure implicit knowledge derived from context into graphs to assist LLMs in reasoning? In this paper, we propose Reasoning with Graphs (RwG) by first constructing explicit graphs from the context and then leveraging these graphs to enhance LLM reasoning performance on reasoning tasks. Extensive experiments demonstrate the effectiveness of the proposed method in improving both logical reasoning and multi-hop question answering tasks.
Chinese: 大语言模型在复杂推理任务中存在困难,但本文提出的“图推理”方法通过从上下文中构建显式图结构,有效提升了逻辑推理和多跳问答的性能。
English: Large language models face difficulties in complex reasoning tasks, but the proposed Reasoning with Graphs method enhances their performance by constructing explicit graphs from context to improve logical reasoning and multi-hop question answering.

Authors:Zhengpeng Xie, Jiahang Cao, Changwei Wang, Fan Yang, Marco Hutter, Qiang Zhang, Jianxiong Zhang, Renjing Xu
Title: Representation Convergence: Mutual Distillation is Secretly a Form of Regularization
Abstract:
In this paper, we argue that mutual distillation between reinforcement learning policies serves as an implicit regularization, preventing them from overfitting to irrelevant features. We highlight two separate contributions: (i) Theoretically, for the first time, we prove that enhancing the policy robustness to irrelevant features leads to improved generalization performance. (ii) Empirically, we demonstrate that mutual distillation between policies contributes to such robustness, enabling the spontaneous emergence of invariant representations over pixel inputs. Ultimately, we do not claim to achieve state-of-the-art performance but rather focus on uncovering the underlying principles of generalization and deepening our understanding of its mechanisms.
中文: 强化学习策略间的相互蒸馏可作为隐式正则化,防止对无关特征的过拟合,理论与实证均表明该方法通过不变性表征提升了鲁棒性与泛化能力。
English: Mutual distillation between reinforcement learning policies acts as implicit regularization by preventing overfitting to irrelevant features, theoretically proving and empirically demonstrating that this enhances robustness and generalization through invariant representations.

Authors:Jingkai Sun, Qiang Zhang, Jiaxu Wang, Jiahang Cao, Renjing Xu
Title: Event Masked Autoencoder: Point-wise Action Recognition with Event-Based Cameras
Abstract:
Dynamic vision sensors (DVS) are bio-inspired devices that capture visual information in the form of asynchronous events, which encode changes in pixel intensity with high temporal resolution and low latency. These events provide rich motion cues that can be exploited for various computer vision tasks, such as action recognition. However, most existing DVS-based action recognition methods lose temporal information during data transformation or suffer from noise and outliers caused by sensor imperfections or environmental factors. To address these challenges, we propose a novel framework that preserves and exploits the spatiotemporal structure of event data for action recognition. Our framework consists of two main components: 1) a point-wise event masked autoencoder (MAE) that learns a compact and discriminative representation of event patches by reconstructing them from masked raw event camera points data; 2) an improved event points patch generation algorithm that leverages an event data inlier model and point-wise data augmentation techniques to enhance the quality and diversity of event points patches. To the best of our knowledge, our approach introduces the pre-train method into event camera raw points data for the first time, and we propose a novel event points patch embedding to utilize transformer-based models on event cameras.
中文摘要:该框架通过引入逐点事件掩码自编码器和改进的事件点块生成算法,解决了动态视觉传感器动作识别中的时空信息丢失和噪声问题,首次将预训练方法应用于事件相机原始数据。
English Summary: The proposed framework addresses challenges in DVS-based action recognition by introducing a point-wise event masked autoencoder and improved patch generation algorithm to preserve spatiotemporal information while reducing noise.

Authors:Ting-Yao E. Hsu, Yi-Li Hsu, Shaurya Rohatgi, Chieh-Yang Huang, Ho Yin Sam Ng, Ryan Rossi, Sungchul Kim, Tong Yu, Lun-Wei Ku, C. Lee Giles, Ting-Hao K. Huang
Title: Do Large Multimodal Models Solve Caption Generation for Scientific Figures? Lessons Learned from SciCap Challenge 2023
Abstract:
Since the SciCap datasets launch in 2021, the research community has made significant progress in generating captions for scientific figures in scholarly articles. In 2023, the first SciCap Challenge took place, inviting global teams to use an expanded SciCap dataset to develop models for captioning diverse figure types across various academic fields. At the same time, text generation models advanced quickly, with many powerful pre-trained large multimodal models (LMMs) emerging that showed impressive capabilities in various vision-and-language tasks. This paper presents an overview of the first SciCap Challenge and details the performance of various models on its data, capturing a snapshot of the fields state. We found that professional editors overwhelmingly preferred figure captions generated by GPT-4V over those from all other models and even the original captions written by authors. Following this key finding, we conducted detailed analyses to answer this question: Have advanced LMMs solved the task of generating captions for scientific figures?
中文:自2021年SciCap数据集发布以来,科学图表标注研究取得显著进展,2023年首届SciCap挑战赛结果显示,GPT-4V生成的图表说明在专业编辑评选中全面优于其他模型乃至作者原稿,引发对先进大模型是否已完全解决科学图表标注任务的深入探讨。
English: Since the SciCap datasets' introduction in 2021, significant advancements have been made in scientific figure captioning, culminating in the 2023 SciCap Challenge where GPT-4V's generated captions were overwhelmingly preferred by professional editors over other models and even original author-written captions, prompting detailed analysis on whether advanced large multimodal models have fully solved this task.

Authors:Tierui Gong, Chau Yuen, Chong Meng Samson See, Mérouane Debbah, Lajos Hanzo
Title: Rydberg Atomic Quantum Receivers for the Multi-User MIMO Uplink
Abstract:
Rydberg atomic quantum receivers exhibit great potential in assisting classical wireless communications due to their outstanding advantages in detecting radio frequency signals. To realize this potential, we integrate a Rydberg atomic quantum receiver into a classical multi-user multiple-input multiple-output (MIMO) scheme to form a multi-user Rydberg atomic quantum MIMO (RAQ-MIMO) system for the uplink. To study this system, we first construct an equivalent baseband signal model, which facilitates convenient system design, signal processing and optimizations. We then study the ergodic achievable rates under both the maximum ratio combining (MRC) and zero-forcing (ZF) schemes by deriving their tight lower bounds. We next compare the ergodic achievable rates of the RAQ-MIMO and the conventional massive MIMO schemes by offering a closed-form expression for the difference of their ergodic achievable rates, which allows us to directly compare the two systems. Our results show that RAQ-MIMO allows the average transmit power of users to be $> 25$ dBm lower than that of the conventional massive MIMO. Viewed from a different perspective, an extra $\sim 8.8$ bits/s/Hz/user rate becomes achievable by ZF RAQ-MIMO.
中文摘要:Rydberg原子量子MIMO(RAQ-MIMO)系统相比传统大规模MIMO展现出显著优势,可实现用户发射功率降低超过25 dBm,或通过迫零处理获得约8.8 bits/s/Hz/用户的额外速率提升。
English Summary: The Rydberg atomic quantum MIMO (RAQ-MIMO) system demonstrates superior performance over conventional massive MIMO, enabling over 25 dBm lower user transmit power or achieving approximately 8.8 bits/s/Hz/user higher rates with zero-forcing processing.

Authors:Mingkuan Feng, Jinyang Wu, Shuai Zhang, Pengpeng Shao, Ruihan Jin, Zhengqi Wen, Jianhua Tao, Feihu Che
Title: DReSS: Data-driven Regularized Structured Streamlining for Large Language Models
Abstract:
Large language models (LLMs) have achieved significant progress across various domains, but their increasing scale results in high computational and memory costs. Recent studies have revealed that LLMs exhibit sparsity, providing the potential to reduce model size through pruning techniques. However, existing pruning methods typically follow a prune-then-finetune paradigm. Since the pruned components still contain valuable information, their direct removal often leads to irreversible performance degradation, imposing a substantial computational burden to recover performance during finetuning. In this paper, we propose a novel paradigm that first applies regularization, then prunes, and finally finetunes. Based on this paradigm, we introduce DReSS, a simple and effective Data-driven Regularized Structured Streamlining method for LLMs. By leveraging a small amount of data to regularize the components to be pruned, DReSS explicitly transfers the important information to the remaining parts of the model in advance. Compared to direct pruning, this can reduce the information loss caused by parameter removal, thereby enhancing its language modeling capabilities. Experimental results demonstrate that DReSS significantly outperforms existing pruning methods even under extreme pruning ratios, significantly reducing latency and increasing throughput.
Chinese: 大型语言模型可通过剪枝提升效率,但现有方法会导致性能损失,因此新方法DReSS采用数据驱动的正则化技术,在极端剪枝条件下仍能保留信息并增强性能。
English: Large language models can be made more efficient through pruning, but existing methods cause performance loss, so a new approach called DReSS uses data-driven regularization to preserve information and enhance performance even under extreme pruning.

Authors:Kaiying Yan, Moyang Liu, Yukun Liu, Ruibo Fu, Zhengqi Wen, Jianhua Tao, Xuefei Liu, Guanjun Li
Title: MTPareto: A MultiModal Targeted Pareto Framework for Fake News Detection
Abstract:
Multimodal fake news detection is essential for maintaining the authenticity of Internet multimedia information. Significant differences in form and content of multimodal information lead to intensified optimization conflicts, hindering effective model training as well as reducing the effectiveness of existing fusion methods for bimodal. To address this problem, we propose the MTPareto framework to optimize multimodal fusion, using a Targeted Pareto(TPareto) optimization algorithm for fusion-level-specific objective learning with a certain focus. Based on the designed hierarchical fusion network, the algorithm defines three fusion levels with corresponding losses and implements all-modal-oriented Pareto gradient integration for each. This approach accomplishes superior multimodal fusion by utilizing the information obtained from intermediate fusion to provide positive effects to the entire process. Experiment results on FakeSV and FVC datasets show that the proposed framework outperforms baselines and the TPareto optimization algorithm achieves 2.40% and 1.89% accuracy improvement respectively.
中文: MTPareto框架通过针对性帕累托优化算法提升多模态融合效果,在假新闻检测任务中显著优于基线方法并在标准数据集上实现了准确率提升。
English: The MTPareto framework addresses multimodal fake news detection by employing a Targeted Pareto optimization algorithm to enhance fusion effectiveness, achieving notable accuracy improvements on benchmark datasets.

Authors:Ho Yin, Ng, Ting-Yao Hsu, Jiyoo Min, Sungchul Kim, Ryan A. Rossi, Tong Yu, Hyunggu Jung, Ting-Hao 'Kenneth' Huang
Title: Understanding How Paper Writers Use AI-Generated Captions in Figure Caption Writing
Abstract:
Figures and their captions play a key role in scientific publications. However, despite their importance, many captions in published papers are poorly crafted, largely due to a lack of attention by paper authors. While prior AI research has explored caption generation, it has mainly focused on reader-centered use cases, where users evaluate generated captions rather than actively integrating them into their writing. This paper addresses this gap by investigating how paper authors incorporate AI-generated captions into their writing process through a user study involving 18 participants. Each participant rewrote captions for two figures from their own recently published work, using captions generated by state-of-the-art AI models as a resource. By analyzing video recordings of the writing process through interaction analysis, we observed that participants often began by copying and refining AI-generated captions. Paper writers favored longer, detail-rich captions that integrated textual and visual elements but found current AI models less effective for complex figures. These findings highlight the nuanced and diverse nature of figure caption composition, revealing design opportunities for AI systems to better support the challenges of academic writing.
中文: 本研究探讨作者如何将AI生成的图注融入写作过程,发现他们常通过复制和精炼来丰富细节,但现有AI对复杂图表处理不足,这为优化学术写作辅助系统指明了设计方向。
English: This study explores how authors integrate AI-generated captions into their writing process, revealing that while they often refine these captions for richer detail, current AI struggles with complex figures, pointing to future design improvements for academic support.

Authors:Jihwan Lee, Tiantian Feng, Aditya Kommineni, Sudarsana Reddy Kadiri, Shrikanth Narayanan
Title: Enhancing Listened Speech Decoding from EEG via Parallel Phoneme Sequence Prediction
Abstract:
Brain-computer interfaces (BCI) offer numerous human-centered application possibilities, particularly affecting people with neurological disorders. Text or speech decoding from brain activities is a relevant domain that could augment the quality of life for people with impaired speech perception. We propose a novel approach to enhance listened speech decoding from electroencephalography (EEG) signals by utilizing an auxiliary phoneme predictor that simultaneously decodes textual phoneme sequences. The proposed model architecture consists of three main parts: EEG module, speech module, and phoneme predictor. The EEG module learns to properly represent EEG signals into EEG embeddings. The speech module generates speech waveforms from the EEG embeddings. The phoneme predictor outputs the decoded phoneme sequences in text modality. Our proposed approach allows users to obtain decoded listened speech from EEG signals in both modalities (speech waveforms and textual phoneme sequences) simultaneously, eliminating the need for a concatenated sequential pipeline for each modality. The proposed approach also outperforms previous methods in both modalities. The source code and speech samples are publicly available.
中文: 脑机接口通过一种新颖模型,利用辅助音素预测器同时从脑电信号解码出语音波形和文本音素序列,从而提升了听语音解码的性能,优于以往方法。
English: Brain-computer interfaces enhance listened speech decoding from EEG signals by using a novel model that simultaneously outputs speech waveforms and textual phoneme sequences, improving performance over previous methods.

Authors:Haobo Yuan, Xiangtai Li, Tao Zhang, Zilong Huang, Shilin Xu, Shunping Ji, Yunhai Tong, Lu Qi, Jiashi Feng, Ming-Hsuan Yang
Title: Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
Abstract:
This work presents Sa2VA, the first unified model for dense grounded understanding of both images and videos. Unlike existing multi-modal large language models, which are often limited to specific modalities and tasks, Sa2VA supports a wide range of image and video tasks, including referring segmentation and conversation, with minimal one-shot instruction tuning. Sa2VA combines SAM-2, a foundation video segmentation model, with LLaVA, an advanced vision-language model, and unifies text, image, and video into a shared LLM token space. Using the LLM, Sa2VA generates instruction tokens that guide SAM-2 in producing precise masks, enabling a grounded, multi-modal understanding of both static and dynamic visual content. Additionally, we introduce Ref-SAV, an auto-labeled dataset containing over 72k object expressions in complex video scenes, designed to boost model performance. We also manually validate 2k video objects in the Ref-SAV datasets to benchmark referring video object segmentation in complex environments. Experiments show that Sa2VA achieves state-of-the-art across multiple tasks, particularly in referring video object segmentation, highlighting its potential for complex real-world applications.
中文: Sa2VA是首个统一理解图像和视频密集基础信息的模型,结合SAM-2和LLaVA,通过少量指令调整即可支持多种任务,并在指代视频对象分割等复杂应用中达到领先性能。
English: Sa2VA is the first unified model for dense grounded understanding of images and videos, integrating SAM-2 and LLaVA to support various tasks with minimal instruction tuning and achieving state-of-the-art performance, especially in referring video object segmentation.

Authors:Tierui Gong, Chau Yuen, Chong Meng Samson See, Mérouane Debbah, Lajos Hanzo
Title: Rydberg Atomic Quantum Receivers for Multi-Target DOA Estimation
Abstract:
Quantum sensing technologies have experienced rapid progresses since entering the `second quantum revolution'. Among various candidates, schemes relying on Rydberg atoms exhibit compelling advantages for detecting radio frequency signals. Based on this, Rydberg atomic quantum receivers (RAQRs) have emerged as a promising solution to classical wireless communication and sensing. To harness the advantages and exploit the potential of RAQRs in wireless sensing, we investigate the realization of the direction of arrival (DOA) estimation by RAQRs. Specifically, we first conceive a Rydberg atomic quantum uniform linear array (RAQ-ULA) aided wireless receiver for multi-target DOA detection and propose the corresponding signal model of this sensing system. Our model reveals that the presence of the radio-frequency local oscillator in the RAQ-ULA creates sensor gain mismatches, which degrade the DOA estimation significantly by employing the classical Estimation of Signal Parameters via Rotational Invariant Techniques (ESPRIT). To solve this sensor gain mismatch problem, we propose the Rydberg atomic quantum ESPRIT (RAQ-ESPRIT) relying on our model. Lastly, we characterize our scheme through numerical simulations, where the results exhibit that it is capable of reducing the estimation error of its classical counterpart on the order of $> 400$-fold and $> 9000$-fold in the PSL and SQL, respectively.
中文摘要:本研究提出了一种基于里德堡原子量子接收器的多目标波达方向估计系统,并开发了RAQ-ESPRIT算法解决传感器增益失配问题,在不同场景下相比传统方法将估计误差降低了超过400倍和9000倍。
English Summary: This study introduces a Rydberg atomic quantum receiver system for multi-target direction-of-arrival estimation and proposes a novel RAQ-ESPRIT algorithm that addresses sensor gain mismatch issues, achieving over 400-fold and 9000-fold error reduction compared to classical methods in different scenarios.

Authors:Zhaoliang Wan, Yonggen Ling, Senlin Yi, Lu Qi, Wangwei Lee, Minglei Lu, Sicheng Yang, Xiao Teng, Peng Lu, Xu Yang, Ming-Hsuan Yang, Hui Cheng
Title: VinT-6D: A Large-Scale Object-in-hand Dataset from Vision, Touch and Proprioception
Abstract:
This paper addresses the scarcity of large-scale datasets for accurate object-in-hand pose estimation, which is crucial for robotic in-hand manipulation within the ``Perception-Planning-Control" paradigm. Specifically, we introduce VinT-6D, the first extensive multi-modal dataset integrating vision, touch, and proprioception, to enhance robotic manipulation. VinT-6D comprises 2 million VinT-Sim and 0.1 million VinT-Real splits, collected via simulations in MuJoCo and Blender and a custom-designed real-world platform. This dataset is tailored for robotic hands, offering models with whole-hand tactile perception and high-quality, well-aligned data. To the best of our knowledge, the VinT-Real is the largest considering the collection difficulties in the real-world environment so that it can bridge the gap of simulation to real compared to the previous works. Built upon VinT-6D, we present a benchmark method that shows significant improvements in performance by fusing multi-modal information. The project is available at https://VinT-6D.github.io/.
中文: 本文提出了首个融合视觉、触觉和本体感知的大规模多模态数据集VinT-6D,解决了机器人手内操作中物体姿态估计数据稀缺的问题,并基于该数据集开发了通过多模态融合显著提升性能的基准方法。
English: This paper introduces VinT-6D, the first large-scale multi-modal dataset combining vision, touch, and proprioception to address the scarcity of data for object-in-hand pose estimation in robotic manipulation, along with a benchmark method that demonstrates significant performance improvements through multi-modal fusion.

Authors:Yongchao Wang, Junjie Wang, Xiaobin Zhou, Tiankai Yang, Chao Xu, Fei Gao
Title: Safe and Agile Transportation of Cable-Suspended Payload via Multiple Aerial Robots
Abstract:
Transporting a heavy payload using multiple aerial robots (MARs) is an efficient manner to extend the load capacity of a single aerial robot. However, existing schemes for the multiple aerial robots transportation system (MARTS) still lack the capability to generate a collision-free and dynamically feasible trajectory in real-time and further track an agile trajectory especially when there are no sensors available to measure the states of payload and cable. Therefore, they are limited to low-agility transportation in simple environments. To bridge the gap, we propose complete planning and control schemes for the MARTS, achieving safe and agile aerial transportation (SAAT) of a cable-suspended payload in complex environments. Flatness maps for the aerial robot considering the complete kinematical constraint and the dynamical coupling between each aerial robot and payload are derived. To improve the responsiveness for the generation of the safe, dynamically feasible, and agile trajectory in complex environments, a real-time spatio-temporal trajectory planning scheme is proposed for the MARTS. Besides, we break away from the reliance on the state measurement for both the payload and cable, as well as the closed-loop control for the payload, and propose a fully distributed control scheme to track the agile trajectory that is robust against imprecise payload mass and non-point mass payload. The proposed schemes are extensively validated through benchmark comparisons, ablation studies, and simulations. Finally, extensive real-world experiments are conducted on a MARTS integrated by three aerial robots with onboard computers and sensors. The result validates the efficiency and robustness of our proposed schemes for SAAT in complex environments.
中文摘要:本研究提出了一套完整的多飞行器规划与控制方案,实现了复杂环境下线缆悬挂负载的安全敏捷运输,其特点包括实时轨迹生成和无需负载状态测量的分布式控制。
English Summary: This study introduces a complete planning and control framework for multiple aerial robots to safely and agilely transport cable-suspended payloads in complex environments, featuring real-time trajectory generation and distributed control without relying on payload state measurements.

Authors:Zili Liu, Hao Chen, Lei Bai, Wenyuan Li, Zhengxia Zou, Zhenwei Shi
Title: Kolmogorov Arnold Neural Interpolator for Downscaling and Correcting Meteorological Fields from In-Situ Observations
Abstract:
Obtaining accurate weather forecasts at station locations is a critical challenge due to systematic biases arising from the mismatch between multi-scale, continuous atmospheric characteristic and their discrete, gridded representations. Previous works have primarily focused on modeling gridded meteorological data, inherently neglecting the off-grid, continuous nature of atmospheric states and leaving such biases unresolved. To address this, we propose the Kolmogorov Arnold Neural Interpolator (KANI), a novel framework that redefines meteorological field representation as continuous neural functions derived from discretized grids. Grounded in the Kolmogorov Arnold theorem, KANI captures the inherent continuity of atmospheric states and leverages sparse in-situ observations to correct these biases systematically. Furthermore, KANI introduces an innovative zero-shot downscaling capability, guided by high-resolution topographic textures without requiring high-resolution meteorological fields for supervision. Experimental results across three sub-regions of the continental United States indicate that KANI achieves an accuracy improvement of 40.28% for temperature and 67.41% for wind speed, highlighting its significant improvement over traditional interpolation methods. This enables continuous neural representation of meteorological variables through neural networks, transcending the limitations of conventional grid-based representations.
中文摘要:Kolmogorov Arnold神经插值器(KANI)通过构建连续神经函数框架,利用稀疏观测数据和高分辨率地形特征,系统修正气象预报中的网格偏差,实现了温度和风速预测精度超过40%和67%的显著提升。
English Summary: The Kolmogorov Arnold Neural Interpolator (KANI) introduces a continuous neural framework that corrects systematic biases in weather forecasting by leveraging sparse observations and topographic data, achieving over 40% improvement in temperature and wind speed accuracy.

Authors:Mingqi Yuan, Bo Li, Xin Jin, Wenjun Zeng
Title: Deep Reinforcement Learning with Hybrid Intrinsic Reward Model
Abstract:
Intrinsic reward shaping has emerged as a prevalent approach to solving hard-exploration and sparse-rewards environments in reinforcement learning (RL). While single intrinsic rewards, such as curiosity-driven or novelty-based methods, have shown effectiveness, they often limit the diversity and efficiency of exploration. Moreover, the potential and principle of combining multiple intrinsic rewards remains insufficiently explored. To address this gap, we introduce HIRE (Hybrid Intrinsic REward), a flexible and elegant framework for creating hybrid intrinsic rewards through deliberate fusion strategies. With HIRE, we conduct a systematic analysis of the application of hybrid intrinsic rewards in both general and unsupervised RL across multiple benchmarks. Extensive experiments demonstrate that HIRE can significantly enhance exploration efficiency and diversity, as well as skill acquisition in complex and dynamic settings.
中文: HIRE框架通过融合多种内在奖励,在强化学习中显著提升了探索效率、多样性和复杂环境下的技能获取能力。
English: HIRE is a flexible framework that combines multiple intrinsic rewards to significantly improve exploration efficiency, diversity, and skill acquisition in reinforcement learning across various benchmarks.

Authors:Songru Yang, Zili Liu, Zhenwei Shi, Zhengxia Zou
Title: WSSM: Geographic-enhanced hierarchical state-space model for global station weather forecast
Abstract:
Global Station Weather Forecasting (GSWF), a prominent meteorological research area, is pivotal in providing timely localized weather predictions. Despite the progress existing models have made in the overall accuracy of the GSWF, executing high-precision extreme event prediction still presents a substantial challenge. The recent emergence of state-space models, with their ability to efficiently capture continuous-time dynamics and latent states, offer potential solutions. However, early investigations indicated that Mamba underperforms in the context of GSWF, suggesting further adaptation and optimization. To tackle this problem, in this paper, we introduce Weather State-space Model (WSSM), a novel Mamba-based approach tailored for GSWF. Geographical knowledge is integrated in addition to the widely-used positional encoding to represent the absolute special-temporal position. The multi-scale time-frequency features are synthesized from coarse to fine to model the seasonal to extreme weather dynamic. Our method effectively improves the overall prediction accuracy and addresses the challenge of forecasting extreme weather events. The state-of-the-art results obtained on the Weather-5K subset underscore the efficacy of the WSSM
Chinese: 本文提出天气状态空间模型(WSSM),这是一种基于Mamba架构的新方法,通过融合地理知识和多尺度时频特征,有效提升了全球站点天气预报的整体精度和极端天气事件预测能力。
English: This paper introduces the Weather State-space Model (WSSM), a Mamba-based approach that integrates geographical knowledge and multi-scale time-frequency features to enhance both overall accuracy and extreme weather prediction in global station weather forecasting.

Authors:Maciej Besta, Julia Barth, Eric Schreiber, Ales Kubicek, Afonso Catarino, Robert Gerstenberger, Piotr Nyczyk, Patrick Iff, Yueling Li, Sam Houliston, Tomasz Sternal, Marcin Copik, Grzegorz Kwaśniewski, Jürgen Müller, Łukasz Flis, Hannes Eberhard, Zixuan Chen, Hubert Niewiadomski, Torsten Hoefler
Title: Reasoning Language Models: A Blueprint
Abstract:
Reasoning language models (RLMs), also known as Large Reasoning Models (LRMs), such as OpenAI's o1 and o3, DeepSeek-R1, and Alibaba's QwQ, have redefined AI's problem-solving capabilities by extending LLMs with advanced reasoning mechanisms. Yet, their high costs, proprietary nature, and complex architectures - uniquely combining reinforcement learning (RL), search heuristics, and LLMs - present accessibility and scalability challenges. To address these, we propose a comprehensive blueprint that organizes RLM components into a modular framework, based on a survey and analysis of all RLM works. This blueprint incorporates diverse reasoning structures (chains, trees, graphs, and nested forms), reasoning strategies (e.g., Monte Carlo Tree Search, Beam Search), RL concepts (policy, value models and others), supervision schemes (Outcome-Based and Process-Based Supervision), and other related concepts (e.g., Test-Time Compute, Retrieval-Augmented Generation, agent tools). We also provide detailed mathematical formulations and algorithmic specifications to simplify RLM implementation. By showing how schemes like LLaMA-Berry, QwQ, Journey Learning, and Graph of Thoughts fit as special cases, we demonstrate the blueprint's versatility and unifying potential. To illustrate its utility, we introduce x1, a modular implementation for rapid RLM prototyping and experimentation. Using x1 and a literature review, we provide key insights, such as multi-phase training for policy and value models, and the importance of familiar training distributions. Finally, we discuss scalable RLM cloud deployments and we outline how RLMs can integrate with a broader LLM ecosystem. Our work demystifies RLM construction, democratizes advanced reasoning capabilities, and fosters innovation, aiming to mitigate the gap between "rich AI" and "poor AI" by lowering barriers to RLM design and experimentation.
The abstract introduces a modular blueprint for developing reasoning language models (RLMs) to address their high costs and complexity, featuring a versatile framework with mathematical formulations and an implementation tool (x1) to democratize advanced AI reasoning capabilities.
English Summary:

Authors:Siran Peng, Tianshuo Zhang, Li Gao, Xiangyu Zhu, Haoyuan Zhang, Kai Pang, Zhen Lei
Title: WMamba: Wavelet-based Mamba for Face Forgery Detection
Abstract:
With the rapid advancement of deepfake generation technologies, the demand for robust and accurate face forgery detection algorithms has become increasingly critical. Recent studies have demonstrated that wavelet analysis can uncover subtle forgery artifacts that remain imperceptible in the spatial domain. Wavelets effectively capture important facial contours, which are often slender, fine-grained, and global in nature. However, existing wavelet-based approaches fail to fully leverage these unique characteristics, resulting in sub-optimal feature extraction and limited generalizability. To address this challenge, we introduce WMamba, a novel wavelet-based feature extractor built upon the Mamba architecture. WMamba maximizes the utility of wavelet information through two key innovations. First, we propose Dynamic Contour Convolution (DCConv), which employs specially crafted deformable kernels to adaptively model slender facial contours. Second, by leveraging the Mamba architecture, our method captures long-range spatial relationships with linear computational complexity. This efficiency allows for the extraction of fine-grained, global forgery artifacts from small image patches. Extensive experimental results show that WMamba achieves state-of-the-art (SOTA) performance, highlighting its effectiveness and superiority in face forgery detection.
中文: 提出的WMamba模型结合小波分析和基于Mamba的架构,通过动态轮廓卷积有效捕捉细微的人脸伪造痕迹,在伪造检测中实现了最先进的性能。
English: The proposed WMamba model utilizes wavelet analysis and a Mamba-based architecture with Dynamic Contour Convolution to effectively capture subtle facial forgery artifacts, achieving state-of-the-art performance in face forgery detection.

Authors:Qi Chen, Changli Wu, Jiayi Ji, Yiwei Ma, Danni Yang, Xiaoshuai Sun
Title: IPDN: Image-enhanced Prompt Decoding Network for 3D Referring Expression Segmentation
Abstract:
3D Referring Expression Segmentation (3D-RES) aims to segment point cloud scenes based on a given expression. However, existing 3D-RES approaches face two major challenges: feature ambiguity and intent ambiguity. Feature ambiguity arises from information loss or distortion during point cloud acquisition due to limitations such as lighting and viewpoint. Intent ambiguity refers to the model's equal treatment of all queries during the decoding process, lacking top-down task-specific guidance. In this paper, we introduce an Image enhanced Prompt Decoding Network (IPDN), which leverages multi-view images and task-driven information to enhance the model's reasoning capabilities. To address feature ambiguity, we propose the Multi-view Semantic Embedding (MSE) module, which injects multi-view 2D image information into the 3D scene and compensates for potential spatial information loss. To tackle intent ambiguity, we designed a Prompt-Aware Decoder (PAD) that guides the decoding process by deriving task-driven signals from the interaction between the expression and visual features. Comprehensive experiments demonstrate that IPDN outperforms the state-ofthe-art by 1.9 and 4.2 points in mIoU metrics on the 3D-RES and 3D-GRES tasks, respectively.
中文: 本文提出的图像增强提示解码网络(IPDN)通过融合多视角图像语义和任务驱动提示,有效解决了三维指代表达分割中的特征模糊和意图模糊问题,显著提升了分割精度并达到最优性能。
English: This paper introduces an Image enhanced Prompt Decoding Network (IPDN) that addresses feature and intent ambiguity in 3D Referring Expression Segmentation by incorporating multi-view image semantics and task-driven prompts to enhance segmentation accuracy, achieving state-of-the-art performance improvements.

Authors:Zhaonan Wu, Yanjie Zhao, Chen Wei, Zirui Wan, Yue Liu, Haoyu Wang
Title: CommitShield: Tracking Vulnerability Introduction and Fix in Version Control Systems
Abstract:
Version control systems are commonly used to manage open-source software, in which each commit may introduce new vulnerabilities or fix existing ones. Researchers have developed various tools for detecting vulnerabilities in code commits, but their performance is limited by factors such as neglecting descriptive data and challenges in accurately identifying vulnerability introductions. To overcome these limitations, we propose CommitShield, which combines the code analysis capabilities of static analysis tools with the natural language and code understanding capabilities of large language models (LLMs) to enhance the accuracy of vulnerability introduction and fix detection by generating precise descriptions and obtaining rich patch contexts. We evaluate CommitShield using the newly constructed vulnerability repair dataset, CommitVulFix, and a cleaned vulnerability introduction dataset. Experimental results indicate that CommitShield improves recall by 76%-87% over state-of-the-art methods in the vulnerability fix detection task, and its F1-score improves by 15%-27% in the vulnerability introduction detection task.
Chinese: CommitShield 结合静态分析工具与大型语言模型,显著提升了漏洞引入和修复的检测能力,在漏洞修复检测中召回率提高76%-87%,漏洞引入检测的F1分数提升15%-27%。
English: CommitShield integrates static analysis tools with large language models to significantly improve the detection of both vulnerability introductions and fixes, achieving up to 87% higher recall and 27% better F1-scores than existing methods.

Authors:Nantheera Anantrasirichai, Fan Zhang, David Bull
Title: Artificial Intelligence in Creative Industries: Advances Prior to 2025
Abstract:
The rapid advancements in artificial intelligence (AI), particularly in generative AI and large language models (LLMs), have profoundly impacted the creative industries, enabling more innovative content creation, enhancing workflows, and democratizing access to creative tools. This paper explores these technological shifts, with particular focus on how those that have emerged since our previous review in 2022 have expanded creative opportunities and improved efficiency. These technological advancements have enhanced the capabilities of text-to-image, text-to-video, and multimodal generation technologies. In particular, key breakthroughs in LLMs have established new benchmarks in conversational AI, while advancements in image generators have revolutionized content creation. We also discuss the integration of AI into post-production workflows, which has significantly accelerated and improved traditional processes. Once content has been created, it must be delivered to its audiences; the media industry is now facing the demands of increased communication traffic due to creative content. We therefore include a discussion of how AI is beginning to transform the way we represent and compress media content. We highlight the trend toward unified AI frameworks capable of addressing and integrating multiple creative tasks, and we underscore the importance of human insight to drive the creative process and oversight to mitigate AI-generated inaccuracies. Finally, we explore AI's future potential in the creative sector, stressing the need to navigate emerging challenges and to maximize its benefits while addressing the associated risks.
中文: 本文探讨了人工智能,特别是生成式AI和大语言模型的最新进展如何通过提升内容创作、工作流效率和媒体传输变革创意产业,同时强调需要人类监督来管理风险并最大化其益处。
English: This paper examines how recent AI advancements, especially in generative AI and LLMs, are revolutionizing creative industries by enhancing content creation, workflow efficiency, and media delivery, while emphasizing the need for human oversight to manage risks and maximize benefits.

Authors:Jianwei Wang, Yuehai Wang, Kai Wang, Xuemin Lin, Wenjie Zhang, Ying Zhang
Title: Ensemble-based Deep Multilayer Community Search
Abstract:
Multilayer graphs, consisting of multiple interconnected layers, are widely used to model diverse relationships in the real world. A community is a cohesive subgraph that offers valuable insights for analyzing (multilayer) graphs. Recently, there has been an emerging trend focused on searching query-driven communities within the multilayer graphs. However, existing methods for multilayer community search are either 1) rule-based, which suffer from structure inflexibility; or 2) learning-based, which rely on labeled data or fail to capture layer-specific characteristics. To address these, we propose EnMCS, an Ensemble-based unsupervised (i.e., label-free) Multilayer Community Search framework. EnMCS contains two key components, i.e., HoloSearch which identifies potential communities in each layer while integrating both layer-shared and layer-specific information, and EMerge which is an Expectation-Maximization (EM)-based method that synthesizes the potential communities from each layer into a consensus community. Specifically, HoloSearch first employs a graph-diffusion-based model that integrates three label-free loss functions to learn layer-specific and layer-shared representations for each node. Communities in each layer are then identified based on nodes that exhibit high similarity in layer-shared representations while demonstrating low similarity in layer-specific representations w.r.t. the query nodes. To account for the varying layer-specific characteristics of each layer when merging communities, EMerge models the error rates of layers and true community as latent variables. It then employs the EM algorithm to simultaneously minimize the error rates of layers and predict the final consensus community through iterative maximum likelihood estimation. Experiments over 10 real-world datasets highlight the superiority of EnMCS in terms of both efficiency and effectiveness.
中文: EnMCS框架通过HoloSearch识别各层社区并整合共享与特定信息,再经EMerge基于期望最大化算法合并成共识社区,有效解决了现有多层社区搜索方法的不足。
English: The proposed EnMCS framework addresses limitations in multilayer community search by combining HoloSearch, which identifies communities using shared and layer-specific information, and EMerge, which merges them through an EM algorithm for a consensus result.

Authors:Jianwei Wang, Kai Wang, Ying Zhang, Wenjie Zhang, Xiwei Xu, Xuemin Lin
Title: On LLM-Enhanced Mixed-Type Data Imputation with High-Order Message Passing
Abstract:
Missing data imputation, which aims to impute the missing values in the raw datasets to achieve the completeness of datasets, is crucial for modern data-driven models like large language models (LLMs) and has attracted increasing interest over the past decades. Despite its importance, existing solutions for missing data imputation either 1) only support numerical and categorical data or 2) show an unsatisfactory performance due to their design prioritizing text data and the lack of key properties for tabular data imputation. In this paper, we propose UnIMP, a Unified IMPutation framework that leverages LLM and high-order message passing to enhance the imputation of mixed-type data including numerical, categorical, and text data. Specifically, we first introduce a cell-oriented hypergraph to model the table. We then propose BiHMP, an efficient Bidirectional High-order Message-Passing network to aggregate global-local information and high-order relationships on the constructed hypergraph while capturing the inter-column heterogeneity and intra-column homogeneity. To effectively and efficiently align the capacity of the LLM with the information aggregated by BiHMP, we introduce Xfusion, which, together with BiHMP, acts as adapters for the LLM. We follow a pre-training and fine-tuning pipeline to train UnIMP, integrating two optimizations: chunking technique, which divides tables into smaller chunks to enhance efficiency; and progressive masking technique, which gradually adapts the model to learn more complex data patterns. Both theoretical proofs and empirical experiments on 10 real world datasets highlight the superiority of UnIMP over existing techniques.
中文: 本文提出UnIMP统一插补框架,通过结合大语言模型与双向高阶消息传递技术,有效处理混合类型数据的缺失值填补问题,理论与实验均证明其优于现有方法。
English: This paper introduces UnIMP, a unified imputation framework that leverages large language models and bidirectional high-order message passing to effectively handle mixed-type data imputation, demonstrating superior performance over existing methods through both theoretical and empirical validation.

Authors:Chenyang Liu, Keyan Chen, Rui Zhao, Zhengxia Zou, Zhenwei Shi
Title: Text2Earth: Unlocking Text-driven Remote Sensing Image Generation with a Global-Scale Dataset and a Foundation Model
Abstract:
Generative foundation models have advanced large-scale text-driven natural image generation, becoming a prominent research trend across various vertical domains. However, in the remote sensing field, there is still a lack of research on large-scale text-to-image (text2image) generation technology. Existing remote sensing image-text datasets are small in scale and confined to specific geographic areas and scene types. Besides, existing text2image methods have struggled to achieve global-scale, multi-resolution controllable, and unbounded image generation. To address these challenges, this paper presents two key contributions: the Git-10M dataset and the Text2Earth foundation model. Git-10M is a global-scale image-text dataset comprising 10.5 million image-text pairs, 5 times larger than the previous largest one. The dataset covers a wide range of geographic scenes and contains resolution information, significantly surpassing existing datasets in both size and diversity. Building on Git-10M, we propose Text2Earth, a 1.3 billion parameter generative foundation model based on the diffusion framework to model global-scale remote sensing scenes. Text2Earth integrates a resolution guidance mechanism, enabling users to specify image resolutions. A dynamic condition adaptation strategy is proposed for training and inference to improve image quality. Text2Earth excels in zero-shot text2image generation and demonstrates robust generalization and flexibility across multiple tasks, including unbounded scene construction, image editing, and cross-modal image generation. This robust capability surpasses previous models restricted to the basic fixed size and limited scene types. On the previous benchmark dataset, Text2Earth outperforms previous models with an improvement of +26.23 FID and +20.95% Zero-shot Cls-OA metric.Our project page is https://chen-yang-liu.github.io/Text2Earth
中文摘要:本文提出Git-10M数据集和Text2Earth基础模型,解决了遥感领域文本生成图像技术中数据规模小、场景单一的问题,实现了全球尺度、分辨率可控的高质量图像生成。
English Summary: This paper introduces the Git-10M dataset and Text2Earth foundation model to address limitations in remote sensing text-to-image generation, achieving superior performance in global-scale, resolution-controllable image synthesis.

Authors:Juan Wen, Weiyan Hou, Luc Van Gool, Radu Timofte
Title: MatIR: A Hybrid Mamba-Transformer Image Restoration Model
Abstract:
In recent years, Transformers-based models have made significant progress in the field of image restoration by leveraging their inherent ability to capture complex contextual features. Recently, Mamba models have made a splash in the field of computer vision due to their ability to handle long-range dependencies and their significant computational efficiency compared to Transformers. However, Mamba currently lags behind Transformers in contextual learning capabilities. To overcome the limitations of these two models, we propose a Mamba-Transformer hybrid image restoration model called MatIR. Specifically, MatIR cross-cycles the blocks of the Transformer layer and the Mamba layer to extract features, thereby taking full advantage of the advantages of the two architectures. In the Mamba module, we introduce the Image Inpainting State Space (IRSS) module, which traverses along four scan paths to achieve efficient processing of long sequence data. In the Transformer module, we combine triangular window-based local attention with channel-based global attention to effectively activate the attention mechanism over a wider range of image pixels. Extensive experimental results and ablation studies demonstrate the effectiveness of our approach.
中文:提出的MatIR模型通过交叉循环Transformer和Mamba模块,结合了Transformer的上下文学习优势与Mamba的长程依赖处理能力,并引入专门模块显著提升了图像恢复性能。
English: The proposed MatIR model combines Transformer and Mamba architectures through cross-cycling blocks to leverage their respective strengths in contextual learning and long-range dependency handling, introducing specialized modules for enhanced image restoration performance.

Authors:Kenta Uesugi, Naoki Saito, Keisuke Maeda, Takahiro Ogawa, Miki Haseyama
Title: Triplet Synthesis For Enhancing Composed Image Retrieval via Counterfactual Image Generation
Abstract:
Composed Image Retrieval (CIR) provides an effective way to manage and access large-scale visual data. Construction of the CIR model utilizes triplets that consist of a reference image, modification text describing desired changes, and a target image that reflects these changes. For effectively training CIR models, extensive manual annotation to construct high-quality training datasets, which can be time-consuming and labor-intensive, is required. To deal with this problem, this paper proposes a novel triplet synthesis method by leveraging counterfactual image generation. By controlling visual feature modifications via counterfactual image generation, our approach automatically generates diverse training triplets without any manual intervention. This approach facilitates the creation of larger and more expressive datasets, leading to the improvement of CIR model's performance.
中文: 本文提出了一种新颖的反事实图像生成方法,能够自动合成多样化的组合图像检索训练三元组,无需人工标注即可提升模型性能。
English: This paper introduces a novel counterfactual image generation method to automatically synthesize diverse training triplets for Composed Image Retrieval, eliminating the need for manual annotation and enhancing model performance.

Authors:Qingyue Long, Can Rong, Huandong Wang, Yong Li
Title: One Fits All: General Mobility Trajectory Modeling via Masked Conditional Diffusion
Abstract:
Trajectory data play a crucial role in many applications, ranging from network optimization to urban planning. Existing studies on trajectory data are task-specific, and their applicability is limited to the specific tasks on which they have been trained, such as generation, recovery, or prediction. However, the potential of a unified model has not yet been fully explored in trajectory modeling. Although various trajectory tasks differ in inputs, outputs, objectives, and conditions, they share common mobility patterns. Based on these common patterns, we can construct a general framework that enables a single model to address different tasks. However, building a trajectory task-general framework faces two critical challenges: 1) the diversity in the formats of different tasks and 2) the complexity of the conditions imposed on different tasks. In this work, we propose a general trajectory modeling framework via masked conditional diffusion (named GenMove). Specifically, we utilize mask conditions to unify diverse formats. To adapt to complex conditions associated with different tasks, we utilize historical trajectory data to obtain contextual trajectory embeddings, which include rich contexts such as spatiotemporal characteristics and user preferences. Integrating the contextual trajectory embedding into diffusion models through a classifier-free guidance approach allows the model to flexibly adjust its outputs based on different conditions. Extensive experiments on mainstream tasks demonstrate that our model significantly outperforms state-of-the-art baselines, with the highest performance improvement exceeding 13% in generation tasks.
中文摘要:现有轨迹数据研究多为任务专用模型,而GenMove框架通过掩码条件扩散和上下文轨迹嵌入构建通用模型,在多项任务中显著超越现有最优方法,最高性能提升超过13%。
English Summary: Trajectory data applications are often limited by task-specific models, but the proposed GenMove framework uses masked conditional diffusion and contextual embeddings to create a unified model that significantly outperforms existing methods across various tasks.

Authors:Shakeeb Murtaza, Soufiane Belharbi, Marco Pedersoli, Eric Granger
Title: TeD-Loc: Text Distillation for Weakly Supervised Object Localization
Abstract:
Weakly supervised object localization (WSOL) using classification models trained with only image-class labels remains an important challenge in computer vision. Given their reliance on classification objectives, traditional WSOL methods like class activation mapping focus on the most discriminative object parts, often missing the full spatial extent. In contrast, recent WSOL methods based on vision-language models like CLIP require ground truth classes or external classifiers to produce a localization map, limiting their deployment in downstream tasks. Moreover, methods like GenPromp attempt to address these issues but introduce considerable complexity due to their reliance on conditional denoising processes and intricate prompt learning. This paper introduces Text Distillation for Localization (TeD-Loc), an approach that directly distills knowledge from CLIP text embeddings into the model backbone and produces patch-level localization. Multiple instance learning of these image patches allows for accurate localization and classification using one model without requiring external classifiers. Such integration of textual and visual modalities addresses the longstanding challenge of achieving accurate localization and classification concurrently, as WSOL methods in the literature typically converge at different epochs. Extensive experiments show that leveraging text embeddings and localization cues provides a cost-effective WSOL model. TeD-Loc improves Top-1 LOC accuracy over state-of-the-art models by about 5% on both CUB and ILSVRC datasets, while significantly reducing computational complexity compared to GenPromp.
中文: 本文提出TeD-Loc方法,通过将CLIP文本嵌入知识蒸馏到视觉主干中,实现单模型的精准局部定位与分类,在提升定位精度5%的同时显著降低了计算复杂度。
English: This paper introduces TeD-Loc, a weakly supervised object localization method that distills CLIP text embeddings into the visual backbone to achieve accurate patch-level localization and classification within a single model, improving accuracy by 5% while reducing computational complexity.

Authors:Minghao Fu, Biwei Huang, Zijian Li, Yujia Zheng, Ignavier Ng, Guangyi Chen, Yingyao Hu, Kun Zhang
Title: Learning General Causal Structures with Hidden Dynamic Process for Climate Analysis
Abstract:
Understanding climate dynamics requires going beyond correlations in observational data to uncover their underlying causal process. Latent drivers, such as atmospheric processes, play a critical role in temporal dynamics, while direct causal influences also exist among geographically proximate observed variables. Traditional Causal Representation Learning (CRL) typically focuses on latent factors but overlooks such observable-to-observable causal relations, limiting its applicability to climate analysis. In this paper, we introduce a unified framework that jointly uncovers (i) causal relations among observed variables and (ii) latent driving forces together with their interactions. We establish conditions under which both the hidden dynamic processes and the causal structure among observed variables are simultaneously identifiable from time-series data. Remarkably, our guarantees hold even in the nonparametric setting, leveraging contextual information to recover latent variables and causal relations. Building on these insights, we propose CaDRe (Causal Discovery and Representation learning), a time-series generative model with structural constraints that integrates CRL and causal discovery. Experiments on synthetic datasets validate our theoretical results. On real-world climate datasets, CaDRe not only delivers competitive forecasting accuracy but also recovers visualized causal graphs aligned with domain expertise, thereby offering interpretable insights into climate systems.
中文: 本文提出一个统一框架,能够同时揭示观测变量间的因果关系与潜在驱动因素,建立了可识别性条件并开发了CaDRe模型,在气候分析中既实现了优越的预测性能,又生成了符合领域知识的可解释因果图。
English: This paper introduces a unified framework that simultaneously uncovers causal relations among observed variables and latent driving forces, establishing identifiability conditions and proposing the CaDRe model, which demonstrates competitive forecasting and interpretable causal graphs in climate analysis.

Authors:Shixuan Song, Hao Chen, Shu Hu, Xin Wang, Jinrong Hu, Xi Wu
Title: Teacher Encoder-Student Decoder Denoising Guided Segmentation Network for Anomaly Detection
Abstract:
Visual anomaly detection is a highly challenging task, often categorized as a one-class classification and segmentation problem. Recent studies have demonstrated that the student-teacher (S-T) framework effectively addresses this challenge. However, most S-T frameworks rely solely on pre-trained teacher networks to guide student networks in learning multi-scale similar features, overlooking the potential of the student networks to enhance learning through multi-scale feature fusion. In this study, we propose a novel model named PFADSeg, which integrates a pre-trained teacher network, a denoising student network with multi-scale feature fusion, and a guided anomaly segmentation network into a unified framework. By adopting a unique teacher-encoder and student-decoder denoising mode, the model improves the student network's ability to learn from teacher network features. Furthermore, an adaptive feature fusion mechanism is introduced to train a self-supervised segmentation network that synthesizes anomaly masks autonomously, significantly increasing detection performance. Evaluated on the MVTec AD dataset, PFADSeg achieves state-of-the-art results with an image-level AUC of 98.9%, a pixel-level mean precision of 76.4%, and an instance-level mean precision of 78.7%.
Chinese: PFADSeg模型通过融合预训练教师网络、具有多尺度特征融合的去噪学生网络和引导式异常分割网络,显著提升了视觉异常检测性能,在MVTec AD数据集上取得了最优结果。
English: The PFADSeg model enhances visual anomaly detection by integrating a pre-trained teacher network with a student network that utilizes multi-scale feature fusion and a guided segmentation network, achieving state-of-the-art performance on the MVTec AD dataset.

Authors:Zixun Fang, Zhiheng Liu, Kai Zhu, Yu Liu, Ka Leong Cheng, Wei Zhai, Yang Cao, Zheng-Jun Zha
Title: VanGogh: A Unified Multimodal Diffusion-based Framework for Video Colorization
Abstract:
Video colorization aims to transform grayscale videos into vivid color representations while maintaining temporal consistency and structural integrity. Existing video colorization methods often suffer from color bleeding and lack comprehensive control, particularly under complex motion or diverse semantic cues. To this end, we introduce VanGogh, a unified multimodal diffusion-based framework for video colorization. VanGogh tackles these challenges using a Dual Qformer to align and fuse features from multiple modalities, complemented by a depth-guided generation process and an optical flow loss, which help reduce color overflow. Additionally, a color injection strategy and luma channel replacement are implemented to improve generalization and mitigate flickering artifacts. Thanks to this design, users can exercise both global and local control over the generation process, resulting in higher-quality colorized videos. Extensive qualitative and quantitative evaluations, and user studies, demonstrate that VanGogh achieves superior temporal consistency and color fidelity.Project page: https://becauseimbatman0.github.io/VanGogh.
中文: VanGogh是一种基于多模态扩散的视频着色框架,通过融合多种模态特征并采用深度引导生成和光流损失等技术,有效防止颜色溢出和闪烁,确保时间一致性,使用户能够全面控制生成过程,获得高质量着色视频。
English: VanGogh is a multimodal diffusion-based framework that enhances video colorization by integrating multiple modalities and employing techniques like depth guidance and optical flow loss to prevent color bleeding and ensure temporal consistency, offering users comprehensive control for high-quality results.

Authors:Lingjing Kong, Guangyi Chen, Petar Stojanov, Haoxuan Li, Eric P. Xing, Kun Zhang
Title: Towards Understanding Extrapolation: a Causal Lens
Abstract:
Canonical work handling distribution shifts typically necessitates an entire target distribution that lands inside the training distribution. However, practical scenarios often involve only a handful of target samples, potentially lying outside the training support, which requires the capability of extrapolation. In this work, we aim to provide a theoretical understanding of when extrapolation is possible and offer principled methods to achieve it without requiring an on-support target distribution. To this end, we formulate the extrapolation problem with a latent-variable model that embodies the minimal change principle in causal mechanisms. Under this formulation, we cast the extrapolation problem into a latent-variable identification problem. We provide realistic conditions on shift properties and the estimation objectives that lead to identification even when only one off-support target sample is available, tackling the most challenging scenarios. Our theory reveals the intricate interplay between the underlying manifold's smoothness and the shift properties. We showcase how our theoretical results inform the design of practical adaptation algorithms. Through experiments on both synthetic and real-world data, we validate our theoretical findings and their practical implications.
中文摘要:本研究通过将外推问题构建为潜变量识别问题,提供了在仅有少量目标样本(即使超出训练分布)时实现外推的理论框架,并通过实验验证了实际适应算法的有效性。
English Summary: This study provides a theoretical framework for achieving extrapolation with minimal target samples, even outside the training distribution, by formulating it as a latent-variable identification problem and validating practical adaptation algorithms through experiments.

Authors:Wenyan Ma, Lipeng Zhu, Rui Zhang
Title: Movable Antenna Enhanced Integrated Sensing and Communication Via Antenna Position Optimization
Abstract:
In this paper, we propose an integrated sensing and communication (ISAC) system aided by the movable-antenna (MA) array, which can improve the communication and sensing performance via flexible antenna movement over conventional fixed-position antenna (FPA) array. First, we consider the downlink multiuser communication, where each user is randomly distributed within a given three-dimensional zone with local movement. To reduce the overhead of frequent antenna movement, the antenna position vector (APV) is designed based on users' statistical channel state information (CSI), so that the antennas only need to be moved in a large timescale. Then, for target sensing, the Cramer-Rao bounds (CRBs) of the estimation mean square error for different spatial angles of arrival (AoAs) are derived as functions of MAs' positions. Based on the above, we formulate an optimization problem to maximize the expected minimum achievable rate among all communication users, with given constraints on the maximum acceptable CRB thresholds for target sensing. An alternating optimization algorithm is proposed to iteratively optimize one of the horizontal and vertical APVs of the MA array with the other being fixed. Numerical results demonstrate that our proposed MA arrays can significantly enlarge the trade-off region between communication and sensing performance compared to conventional FPA arrays with different inter-antenna spacing. It is also revealed that the steering vectors of the designed MA arrays exhibit low correlation in the angular domain, thus effectively reducing channel correlation among communication users to enhance their achievable rates, while alleviating ambiguity in target angle estimation to achieve improved sensing accuracy.
中文摘要:本文提出一种基于可移动天线阵列的集成传感与通信系统,通过灵活调整天线位置,相比传统固定天线阵列能显著提升多用户通信速率和目标角度估计精度,实现通信与感知性能的协同优化。
English Summary: This paper introduces a movable-antenna array system that enhances integrated sensing and communication performance through flexible antenna positioning, demonstrating significant improvements in both user communication rates and target sensing accuracy compared to conventional fixed-antenna arrays.

Authors:Alexander Korotin, Vladimir V'yugin, Evgeny Burnaev
Title: Online Algorithm for Aggregating Experts' Predictions with Unbounded Quadratic Loss
Abstract:
We consider the problem of online aggregation of expert predictions with the quadratic loss function. We propose an algorithm for aggregating expert predictions which does not require a prior knowledge of the upper bound on the losses. The algorithm is based on the exponential reweighing of expert losses.
Chinese: 本文提出了一种在线聚合专家预测的算法,通过指数重加权专家损失,无需预先知道损失的上界。
English: This paper introduces an algorithm for online aggregation of expert predictions using quadratic loss, which eliminates the need for prior knowledge of loss bounds by employing exponential reweighting of expert losses.

Authors:Hang Yang, Hao Chen, Hui Guo, Yineng Chen, Ching-Sheng Lin, Shu Hu, Jinrong Hu, Xi Wu, Xin Wang
Title: LLM-MedQA: Enhancing Medical Question Answering through Case Studies in Large Language Models
Abstract:
Accurate and efficient question-answering systems are essential for delivering high-quality patient care in the medical field. While Large Language Models (LLMs) have made remarkable strides across various domains, they continue to face significant challenges in medical question answering, particularly in understanding domain-specific terminologies and performing complex reasoning. These limitations undermine their effectiveness in critical medical applications. To address these issues, we propose a novel approach incorporating similar case generation within a multi-agent medical question-answering (MedQA) system. Specifically, we leverage the Llama3.1:70B model, a state-of-the-art LLM, in a multi-agent architecture to enhance performance on the MedQA dataset using zero-shot learning. Our method capitalizes on the model's inherent medical knowledge and reasoning capabilities, eliminating the need for additional training data. Experimental results show substantial performance gains over existing benchmark models, with improvements of 7% in both accuracy and F1-score across various medical QA tasks. Furthermore, we examine the model's interpretability and reliability in addressing complex medical queries. This research not only offers a robust solution for medical question answering but also establishes a foundation for broader applications of LLMs in the medical domain.
中文: 本研究提出一种结合相似病例生成的多智能体医疗问答系统,采用Llama3.1:70B模型实现零样本学习,在准确率和F1值上均提升7%,无需额外训练数据即可显著提升医疗问答性能。
English: This study introduces a multi-agent medical question-answering system using the Llama3.1:70B model with similar case generation, achieving significant performance improvements of 7% in accuracy and F1-score without requiring additional training data.

Authors:Ren Tasai, Guang Li, Ren Togo, Minghui Tang, Takaaki Yoshimura, Hiroyuki Sugimori, Kenji Hirata, Takahiro Ogawa, Kohsuke Kudo, Miki Haseyama
Title: Continual Self-supervised Learning Considering Medical Domain Knowledge in Chest CT Images
Abstract:
We propose a novel continual self-supervised learning method (CSSL) considering medical domain knowledge in chest CT images. Our approach addresses the challenge of sequential learning by effectively capturing the relationship between previously learned knowledge and new information at different stages. By incorporating an enhanced DER into CSSL and maintaining both diversity and representativeness within the rehearsal buffer of DER, the risk of data interference during pretraining is reduced, enabling the model to learn more richer and robust feature representations. In addition, we incorporate a mixup strategy and feature distillation to further enhance the model's ability to learn meaningful representations. We validate our method using chest CT images obtained under two different imaging conditions, demonstrating superior performance compared to state-of-the-art methods.
中文: 我们提出一种针对胸部CT图像的新型持续自监督学习方法,融合医学领域知识,通过改进的DER框架减少数据干扰,并采用混合和特征蒸馏策略增强表征能力,在不同成像条件下均优于现有方法。
English: We introduce a novel continual self-supervised learning method for chest CT images that integrates medical domain knowledge, reduces data interference through an enhanced DER framework, and employs mixup and feature distillation to achieve robust feature representation, outperforming existing techniques under varied imaging conditions.

Authors:Longzhen Li, Guang Li, Ren Togo, Keisuke Maeda, Takahiro Ogawa, Miki Haseyama
Title: Generative Dataset Distillation Based on Self-knowledge Distillation
Abstract:
Dataset distillation is an effective technique for reducing the cost and complexity of model training while maintaining performance by compressing large datasets into smaller, more efficient versions. In this paper, we present a novel generative dataset distillation method that can improve the accuracy of aligning prediction logits. Our approach integrates self-knowledge distillation to achieve more precise distribution matching between the synthetic and original data, thereby capturing the overall structure and relationships within the data. To further improve the accuracy of alignment, we introduce a standardization step on the logits before performing distribution matching, ensuring consistency in the range of logits. Through extensive experiments, we demonstrate that our method outperforms existing state-of-the-art methods, resulting in superior distillation performance.
中文: 本文提出了一种新颖的生成式数据集蒸馏方法,通过自知识蒸馏和对数标准化提升预测对数对齐精度,在实验中展现出优于现有先进技术的蒸馏性能。
English: This paper introduces a novel generative dataset distillation method that enhances prediction logit alignment accuracy through self-knowledge distillation and logit standardization, achieving superior performance over existing state-of-the-art techniques.

Authors:Ji Cao, Tongya Zheng, Qinghong Guo, Yu Wang, Junshu Dai, Shunyu Liu, Jie Yang, Jie Song, Mingli Song
Title: Holistic Semantic Representation for Navigational Trajectory Generation
Abstract:
Trajectory generation has garnered significant attention from researchers in the field of spatio-temporal analysis, as it can generate substantial synthesized human mobility trajectories that enhance user privacy and alleviate data scarcity. However, existing trajectory generation methods often focus on improving trajectory generation quality from a singular perspective, lacking a comprehensive semantic understanding across various scales. Consequently, we are inspired to develop a HOlistic SEmantic Representation (HOSER) framework for navigational trajectory generation. Given an origin-and-destination (OD) pair and the starting time point of a latent trajectory, we first propose a Road Network Encoder to expand the receptive field of road- and zone-level semantics. Second, we design a Multi-Granularity Trajectory Encoder to integrate the spatio-temporal semantics of the generated trajectory at both the point and trajectory levels. Finally, we employ a Destination-Oriented Navigator to seamlessly integrate destination-oriented guidance. Extensive experiments on three real-world datasets demonstrate that HOSER outperforms state-of-the-art baselines by a significant margin. Moreover, the model's performance in few-shot learning and zero-shot learning scenarios further verifies the effectiveness of our holistic semantic representation.
中文摘要:HOSER框架通过融合多尺度语义理解改进了轨迹生成,其整体时空数据表征方法显著超越了现有技术。
English Summary: The HOSER framework enhances trajectory generation by integrating multi-scale semantic understanding, significantly outperforming existing methods through holistic representation of spatio-temporal data.

Authors:Chao Liang, Linchao Zhu, Zongxin Yang, Wei Chen, Yi Yang
Title: Noise-Tolerant Hybrid Prototypical Learning with Noisy Web Data
Abstract:
We focus on the challenging problem of learning an unbiased classifier from a large number of potentially relevant but noisily labeled web images given only a few clean labeled images. This problem is particularly practical because it reduces the expensive annotation costs by utilizing freely accessible web images with noisy labels. Typically, prototypes are representative images or features used to classify or identify other images. However, in the few clean and many noisy scenarios, the class prototype can be severely biased due to the presence of irrelevant noisy images. The resulting prototypes are less compact and discriminative, as previous methods do not take into account the diverse range of images in the noisy web image collections. On the other hand, the relation modeling between noisy and clean images is not learned for the class prototype generation in an end-to-end manner, which results in a suboptimal class prototype. In this article, we introduce a similarity maximization loss named SimNoiPro. Our SimNoiPro first generates noise-tolerant hybrid prototypes composed of clean and noise-tolerant prototypes and then pulls them closer to each other. Our approach considers the diversity of noisy images by explicit division and overcomes the optimization discrepancy issue. This enables better relation modeling between clean and noisy images and helps extract judicious information from the noisy image set. The evaluation results on two extended few-shot classification benchmarks confirm that our SimNoiPro outperforms prior methods in measuring image relations and cleaning noisy data.
Chinese: 本研究提出SimNoiPro方法,通过生成噪声容忍的混合原型,在少量干净标注和大量噪声网络图像中学习无偏分类器,有效改善了关系建模并利用了噪声数据的多样性。
English: This study introduces SimNoiPro, a method that creates noise-tolerant hybrid prototypes to effectively learn unbiased classifiers from a few clean and many noisy web images by improving relation modeling and leveraging diverse noisy data.

Authors:Xuan Shen, Yizhou Wang, Xiangxi Shi, Yanzhi Wang, Pu Zhao, Jiuxiang Gu
Title: Efficient Reasoning with Hidden Thinking
Abstract:
Chain-of-Thought (CoT) reasoning has become a powerful framework for improving complex problem-solving capabilities in Multimodal Large Language Models (MLLMs). However, the verbose nature of textual reasoning introduces significant inefficiencies. In this work, we propose $\textbf{Heima}$ (as hidden llama), an efficient reasoning framework that leverages reasoning CoTs at hidden latent space. We design the Heima Encoder to condense each intermediate CoT into a compact, higher-level hidden representation using a single thinking token, effectively minimizing verbosity and reducing the overall number of tokens required during the reasoning process. Meanwhile, we design corresponding Heima Decoder with traditional Large Language Models (LLMs) to adaptively interpret the hidden representations into variable-length textual sequence, reconstructing reasoning processes that closely resemble the original CoTs. Experimental results across diverse reasoning MLLM benchmarks demonstrate that Heima model achieves higher generation efficiency while maintaining or even better zero-shot task accuracy. Moreover, the effective reconstruction of multimodal reasoning processes with Heima Decoder validates both the robustness and interpretability of our approach.
中文:Heima框架通过将思维链过程压缩为隐藏表示以减少标记数量,从而提升多模态模型的推理效率,同时保持准确性并实现可解释的重构。
English: The Heima framework enhances reasoning efficiency in multimodal models by condensing Chain-of-Thought processes into hidden representations with minimal tokens, maintaining accuracy and enabling interpretable reconstruction.

Authors:Yuke Hu, Zheng Li, Zhihao Liu, Yang Zhang, Zhan Qin, Kui Ren, Chun Chen
Title: Membership Inference Attacks Against Vision-Language Models
Abstract:
Vision-Language Models (VLMs), built on pre-trained vision encoders and large language models (LLMs), have shown exceptional multi-modal understanding and dialog capabilities, positioning them as catalysts for the next technological revolution. However, while most VLM research focuses on enhancing multi-modal interaction, the risks of data misuse and leakage have been largely unexplored. This prompts the need for a comprehensive investigation of such risks in VLMs. In this paper, we conduct the first analysis of misuse and leakage detection in VLMs through the lens of membership inference attack (MIA). In specific, we focus on the instruction tuning data of VLMs, which is more likely to contain sensitive or unauthorized information. To address the limitation of existing MIA methods, we introduce a novel approach that infers membership based on a set of samples and their sensitivity to temperature, a unique parameter in VLMs. Based on this, we propose four membership inference methods, each tailored to different levels of background knowledge, ultimately arriving at the most challenging scenario. Our comprehensive evaluations show that these methods can accurately determine membership status, e.g., achieving an AUC greater than 0.8 targeting a small set consisting of only 5 samples on LLaVA.
Chinese: 视觉语言模型面临数据滥用和泄露的重大风险,为此开发的新型成员推理攻击方法能有效检测未经授权的数据使用,例如在少量样本上实现超过0.8的AUC高准确率。
English: Vision-Language Models (VLMs) face significant risks of data misuse and leakage, prompting the development of novel membership inference attack methods that effectively detect unauthorized data usage with high accuracy, such as achieving over 0.8 AUC on small sample sets.

Authors:Yunbo Lyu, Zhou Yang, Yuqing Niu, Jing Jiang, David Lo
Title: Do Existing Testing Tools Really Uncover Gender Bias in Text-to-Image Models?
Abstract:
Text-to-Image (T2I) models have recently gained significant attention due to their ability to generate high-quality images and are consequently used in a wide range of applications. However, there are concerns about the gender bias of these models. Previous studies have shown that T2I models can perpetuate or even amplify gender stereotypes when provided with neutral text prompts. Researchers have proposed automated gender bias uncovering detectors for T2I models, but a crucial gap exists: no existing work comprehensively compares the various detectors and understands how the gender bias detected by them deviates from the actual situation. This study addresses this gap by validating previous gender bias detectors using a manually labeled dataset and comparing how the bias identified by various detectors deviates from the actual bias in T2I models, as verified by manual confirmation. We create a dataset consisting of 6,000 images generated from three cutting-edge T2I models: Stable Diffusion XL, Stable Diffusion 3, and Dreamlike Photoreal 2.0. During the human-labeling process, we find that all three T2I models generate a portion (12.48% on average) of low-quality images (e.g., generate images with no face present), where human annotators cannot determine the gender of the person. Our analysis reveals that all three T2I models show a preference for generating male images, with SDXL being the most biased. Additionally, images generated using prompts containing professional descriptions (e.g., lawyer or doctor) show the most bias. We evaluate seven gender bias detectors and find that none fully capture the actual level of bias in T2I models, with some detectors overestimating bias by up to 26.95%. We further investigate the causes of inaccurate estimations, highlighting the limitations of detectors in dealing with low-quality images. Based on our findings, we propose an enhanced detector...
中文: 本研究通过比较七种检测器与人工标注数据,揭示了文本到图像模型普遍存在男性偏好,且现有检测器无法准确评估偏差,尤其在处理低质量图像时误差显著。
English: This study identifies a gap in gender bias detection for Text-to-Image models by comparing seven detectors against manually labeled data, revealing that all models exhibit male preference and current detectors inaccurately estimate bias, especially with low-quality images.

Authors:Yuxiang Nie, Sunan He, Yequan Bie, Yihui Wang, Zhixuan Chen, Shu Yang, Zhiyuan Cai, Hongmei Wang, Xi Wang, Luyang Luo, Mingxiang Wu, Xian Wu, Ronald Cheong Kin Chan, Yuk Ming Lau, Yefeng Zheng, Pranav Rajpurkar, Hao Chen
Title: An Explainable Biomedical Foundation Model via Large-Scale Concept-Enhanced Vision-Language Pre-training
Abstract:
The clinical adoption of artificial intelligence (AI) in medical imaging requires models that are both diagnostically accurate and interpretable to clinicians. While current multimodal biomedical foundation models prioritize performance, their black-box nature hinders explaining the decision-making process in clinically meaningful concepts. Here, we present ConceptCLIP, the first explainable biomedical foundation model that achieves state-of-the-art diagnostic accuracy while delivering human-interpretable explanations across diverse imaging modalities. We curate MedConcept-23M, the largest pre-training dataset comprising 23 million image-text-concept triplets across diverse medical modalities, where clinical concepts are derived from the Unified Medical Language System. Leveraging this dataset, we develop ConceptCLIP through a novel dual-alignment approach that simultaneously learns global image-text representations and fine-grained region-concept associations for precise and interpretable medical image analysis. We curate the most extensive evaluation benchmark for multimodal biomedical foundation models, covering 52 clinical tasks spanning 10 imaging modalities. Extensive experiments demonstrate that ConceptCLIP outperforms existing state-of-the-art multimodal biomedical foundation models. Importantly, ConceptCLIP demonstrates superior diagnostic performance while providing human-understandable explanations validated by clinical experts. As the first precise and interpretable biomedical foundation model, ConceptCLIP represents a critical milestone toward the widespread clinical adoption of AI, thereby advancing trustworthy AI in medicine.
Chinese: ConceptCLIP是首个可解释的生物医学基础模型,它在多种医学影像模式中不仅实现了最先进的诊断准确性,还能提供人类可理解的解释,标志着可信赖医疗AI临床应用的重要里程碑。
English: ConceptCLIP is the first explainable biomedical foundation model that achieves state-of-the-art diagnostic accuracy while providing human-interpretable explanations across diverse medical imaging modalities, representing a significant advancement toward trustworthy clinical AI adoption.

Authors:Minrui Xu, Dusit Niyato, Christopher G. Brinton
Title: Serving Long-Context LLMs at the Mobile Edge: Test-Time Reinforcement Learning-based Model Caching and Inference Offloading
Abstract:
Large Language Models (LLMs) can perform zero-shot learning on unseen tasks and few-shot learning on complex reasoning tasks. However, resource-limited mobile edge networks struggle to support long-context LLM serving for LLM agents during multi-round interactions with users. Unlike stateless computation offloading and static service offloading in edge computing, optimizing LLM serving at edge servers is challenging because LLMs continuously learn from context which raises accuracy, latency, and resource consumption dynamics. In this paper, we propose a joint model caching and inference offloading framework that utilizes test-time deep reinforcement learning (T2DRL) to optimize deployment and execution strategies for long-context LLM serving. In this framework, we analyze the performance convergence and design an optimization problem considering the utilization of context windows in LLMs. Furthermore, the T2DRL algorithm can learn in both the training phase and the testing phase to proactively manage cached models and service requests and adapt to context changes and usage patterns during execution. To further enhance resource allocation efficiency, we propose a double Dutch auction (DDA) mechanism, which dynamically matches supply and demand while maximizing social welfare. Finally, experimental results demonstrate that the T2DRL algorithm can reduce system costs by at least 30% compared to baselines while guaranteeing the performance of LLM agents in real-world perception and reasoning tasks.
Chinese: 大语言模型(LLM)支持零样本和少样本学习,但在移动边缘网络中因动态上下文学习而面临挑战;本文提出了一种结合模型缓存和推理卸载的框架,采用测试时深度强化学习优化部署与执行,在保证性能的同时将系统成本降低至少30%。
English: Large Language Models (LLMs) enable zero-shot and few-shot learning but face challenges in mobile edge networks due to dynamic context learning; this paper introduces a joint model caching and inference offloading framework using test-time deep reinforcement learning to optimize deployment and execution, reducing system costs by at least 30% while maintaining performance.

Authors:Tianyuan Yao, Zhiyuan Li, Praitayini Kanakaraj, Derek B. Archer, Kurt Schilling, Lori Beason-Held, Susan Resnick, Bennett A. Landman, Yuankai Huo
Title: Polyhedra Encoding Transformers: Enhancing Diffusion MRI Analysis Beyond Voxel and Volumetric Embedding
Abstract:
Diffusion-weighted Magnetic Resonance Imaging (dMRI) is an essential tool in neuroimaging. It is arguably the sole noninvasive technique for examining the microstructural properties and structural connectivity of the brain. Recent years have seen the emergence of machine learning and data-driven approaches that enhance the speed, accuracy, and consistency of dMRI data analysis. However, traditional deep learning models often fell short, as they typically utilize pixel-level or volumetric patch-level embeddings similar to those used in structural MRI, and do not account for the unique distribution of various gradient encodings. In this paper, we propose a novel method called Polyhedra Encoding Transformer (PE-Transformer) for dMRI, designed specifically to handle spherical signals. Our approach involves projecting an icosahedral polygon onto a unit sphere to resample signals from predetermined directions. These resampled signals are then transformed into embeddings, which are processed by a transformer encoder that incorporates orientational information reflective of the icosahedral structure. Through experimental validation with various gradient encoding protocols, our method demonstrates superior accuracy in estimating multi-compartment models and Fiber Orientation Distributions (FOD), outperforming both conventional CNN architectures and standard transformers.
中文: 本文提出了一种名为多面体编码变换器(PE-Transformer)的新方法,通过将二十面体投影到单位球面对扩散加权MRI的球形信号进行重采样,并利用包含方向信息的变换器编码器处理,在估计脑微结构模型方面比传统方法表现出更高的准确性。
English: This paper introduces the Polyhedra Encoding Transformer (PE-Transformer), a novel method for diffusion-weighted MRI that processes spherical signals by projecting an icosahedral polygon onto a unit sphere and using a transformer encoder with orientational information, achieving superior accuracy in estimating brain microstructure models compared to conventional approaches.

Authors:Shengyuan Colin Lin, Felix Tian, Keyi Wang, Xingjian Zhao, Jimin Huang, Qianqian Xie, Luca Borella, Matt White, Christina Dan Wang, Kairong Xiao, Xiao-Yang Liu Yanglet, Li Deng
Title: Open FinLLM Leaderboard: Towards Financial AI Readiness
Abstract:
Financial large language models (FinLLMs) with multimodal capabilities are envisioned to revolutionize applications across business, finance, accounting, and auditing. However, real-world adoption requires robust benchmarks of FinLLMs' and FinAgents' performance. Maintaining an open leaderboard is crucial for encouraging innovative adoption and improving model effectiveness. In collaboration with Linux Foundation and Hugging Face, we create an open FinLLM leaderboard, which serves as an open platform for assessing and comparing AI models' performance on a wide spectrum of financial tasks. By demoncratizing access to advances of financial knowledge and intelligence, a chatbot or agent may enhance the analytical capabilities of the general public to a professional level within a few months of usage. This open leaderboard welcomes contributions from academia, open-source community, industry, and stakeholders. In particular, we encourage contributions of new datasets, tasks, and models for continual update. Through fostering a collaborative and open ecosystem, we seek to promote financial AI readiness.
中文: 通过与Linux基金会和Hugging Face合作建立的开放FinLLM排行榜,致力于评估和提升多模态金融AI模型的性能,促进金融领域创新和广泛应用。
English: The creation of an open FinLLM leaderboard in collaboration with Linux Foundation and Hugging Face aims to benchmark and enhance multimodal financial AI models' performance, fostering innovation and broad adoption across financial sectors.

Authors:Weiliang Tang, Jia-Hui Pan, Yun-Hui Liu, Masayoshi Tomizuka, Li Erran Li, Chi-Wing Fu, Mingyu Ding
Title: GeoManip: Geometric Constraints as General Interfaces for Robot Manipulation
Abstract:
We present GeoManip, a framework to enable generalist robots to leverage essential conditions derived from object and part relationships, as geometric constraints, for robot manipulation. For example, cutting the carrot requires adhering to a geometric constraint: the blade of the knife should be perpendicular to the carrot's direction. By interpreting these constraints through symbolic language representations and translating them into low-level actions, GeoManip bridges the gap between natural language and robotic execution, enabling greater generalizability across diverse even unseen tasks, objects, and scenarios. Unlike vision-language-action models that require extensive training, operates training-free by utilizing large foundational models: a constraint generation module that predicts stage-specific geometric constraints and a geometry parser that identifies object parts involved in these constraints. A solver then optimizes trajectories to satisfy inferred constraints from task descriptions and the scene. Furthermore, GeoManip learns in-context and provides five appealing human-robot interaction features: on-the-fly policy adaptation, learning from human demonstrations, learning from failure cases, long-horizon action planning, and efficient data collection for imitation learning. Extensive evaluations on both simulations and real-world scenarios demonstrate GeoManip's state-of-the-art performance, with superior out-of-distribution generalization while avoiding costly model training.
中文: GeoManip框架通过从物体关系中提取几何约束并转化为动作,使通用机器人无需训练即可执行操作任务,它利用大型基础模型生成约束并优化轨迹,实现卓越的泛化能力。
English: GeoManip is a framework that enables generalist robots to perform manipulation tasks by deriving geometric constraints from object relationships and translating them into actions without requiring training, using large foundational models for constraint generation and trajectory optimization.

Authors:Ping Guo, Cheng Gong, Xi Lin, Fei Liu, Zhichao Lu, Qingfu Zhang, Zhenkun Wang
Title: MOS-Attack: A Scalable Multi-objective Adversarial Attack Framework
Abstract:
Crafting adversarial examples is crucial for evaluating and enhancing the robustness of Deep Neural Networks (DNNs), presenting a challenge equivalent to maximizing a non-differentiable 0-1 loss function. However, existing single objective methods, namely adversarial attacks focus on a surrogate loss function, do not fully harness the benefits of engaging multiple loss functions, as a result of insufficient understanding of their synergistic and conflicting nature. To overcome these limitations, we propose the Multi-Objective Set-based Attack (MOS Attack), a novel adversarial attack framework leveraging multiple loss functions and automatically uncovering their interrelations. The MOS Attack adopts a set-based multi-objective optimization strategy, enabling the incorporation of numerous loss functions without additional parameters. It also automatically mines synergistic patterns among various losses, facilitating the generation of potent adversarial attacks with fewer objectives. Extensive experiments have shown that our MOS Attack outperforms single-objective attacks. Furthermore, by harnessing the identified synergistic patterns, MOS Attack continues to show superior results with a reduced number of loss functions.
中文摘要:本文提出的多目标集合攻击(MOS Attack)框架通过基于集合的优化策略利用多个损失函数,自动挖掘其协同关系,从而以更少的目标函数生成更强的对抗样本,克服了单目标攻击的局限性。
English Summary: The proposed Multi-Objective Set-based Attack (MOS Attack) framework overcomes limitations of single-objective adversarial attacks by leveraging multiple loss functions through set-based optimization, automatically discovering their synergistic relationships to generate stronger attacks with fewer objectives.

Authors:Yuankun Xie, Xiaopeng Wang, Zhiyong Wang, Ruibo Fu, Zhengqi Wen, Songjun Cao, Long Ma, Chenxing Li, Haonnan Cheng, Long Ye
Title: Neural Codec Source Tracing: Toward Comprehensive Attribution in Open-Set Condition
Abstract:
Current research in audio deepfake detection is gradually transitioning from binary classification to multi-class tasks, referred as audio deepfake source tracing task. However, existing studies on source tracing consider only closed-set scenarios and have not considered the challenges posed by open-set conditions. In this paper, we define the Neural Codec Source Tracing (NCST) task, which is capable of performing open-set neural codec classification and interpretable ALM detection. Specifically, we constructed the ST-Codecfake dataset for the NCST task, which includes bilingual audio samples generated by 11 state-of-the-art neural codec methods and ALM-based out-ofdistribution (OOD) test samples. Furthermore, we establish a comprehensive source tracing benchmark to assess NCST models in open-set conditions. The experimental results reveal that although the NCST models perform well in in-distribution (ID) classification and OOD detection, they lack robustness in classifying unseen real audio. The ST-codecfake dataset and code are available.
中文:当前音频深度伪造检测正从二元分类转向多类别溯源任务,但现有研究仅考虑封闭集场景,本研究为此定义了神经编解码器溯源(NCST)任务,通过构建新数据集和基准测试实现开放集分类与可解释检测。
English: Current audio deepfake detection is shifting from binary to multi-class source tracing, yet existing methods only address closed-set scenarios, prompting this study to define the Neural Codec Source Tracing (NCST) task for open-set classification and interpretable detection using a newly constructed dataset and benchmark.

Authors:Jun Liu, Zhenglun Kong, Peiyan Dong, Changdi Yang, Xuan Shen, Pu Zhao, Hao Tang, Geng Yuan, Wei Niu, Wenbin Zhang, Xue Lin, Dong Huang, Yanzhi Wang
Title: RoRA: Efficient Fine-Tuning of LLM with Reliability Optimization for Rank Adaptation
Abstract:
Fine-tuning helps large language models (LLM) recover degraded information and enhance task performance. Although Low-Rank Adaptation (LoRA) is widely used and effective for fine-tuning, we have observed that its scaling factor can limit or even reduce performance as the rank size increases. To address this issue, we propose RoRA (Rank-adaptive Reliability Optimization), a simple yet effective method for optimizing LoRA's scaling factor. By replacing $α/r$ with $α/\sqrt{r}$, RoRA ensures improved performance as rank size increases. Moreover, RoRA enhances low-rank adaptation in fine-tuning uncompressed models and excels in the more challenging task of accuracy recovery when fine-tuning pruned models. Extensive experiments demonstrate the effectiveness of RoRA in fine-tuning both uncompressed and pruned models. RoRA surpasses the state-of-the-art (SOTA) in average accuracy and robustness on LLaMA-7B/13B, LLaMA2-7B, and LLaMA3-8B, specifically outperforming LoRA and DoRA by 6.5% and 2.9% on LLaMA-7B, respectively. In pruned model fine-tuning, RoRA shows significant advantages; for SHEARED-LLAMA-1.3, a LLaMA-7B with 81.4% pruning, RoRA achieves 5.7% higher average accuracy than LoRA and 3.9% higher than DoRA.
中文: RoRA通过将α/r替换为α/√r来优化LoRA的缩放因子,在秩大小增加时提升性能,并在微调未压缩和剪枝模型时实现了最先进的准确率。
English: RoRA optimizes LoRA's scaling factor by replacing α/r with α/√r, enhancing performance with increasing rank size and achieving state-of-the-art accuracy in fine-tuning both uncompressed and pruned models.

Authors:Kedi Chen, Qin Chen, Jie Zhou, Xinqi Tao, Bowen Ding, Jingwen Xie, Mingchen Xie, Peilong Li, Feng Zheng, Liang He
Title: Enhancing Uncertainty Modeling with Semantic Graph for Hallucination Detection
Abstract:
Large Language Models (LLMs) are prone to hallucination with non-factual or unfaithful statements, which undermines the applications in real-world scenarios. Recent researches focus on uncertainty-based hallucination detection, which utilizes the output probability of LLMs for uncertainty calculation and does not rely on external knowledge or frequent sampling from LLMs. Whereas, most approaches merely consider the uncertainty of each independent token, while the intricate semantic relations among tokens and sentences are not well studied, which limits the detection of hallucination that spans over multiple tokens and sentences in the passage. In this paper, we propose a method to enhance uncertainty modeling with semantic graph for hallucination detection. Specifically, we first construct a semantic graph that well captures the relations among entity tokens and sentences. Then, we incorporate the relations between two entities for uncertainty propagation to enhance sentence-level hallucination detection. Given that hallucination occurs due to the conflict between sentences, we further present a graph-based uncertainty calibration method that integrates the contradiction probability of the sentence with its neighbors in the semantic graph for uncertainty calculation. Extensive experiments on two datasets show the great advantages of our proposed approach. In particular, we obtain substantial improvements with 19.78% in passage-level hallucination detection.
中文摘要:本文提出一种方法,通过构建语义图捕捉实体标记和句子间的关系,结合不确定性传播与校准来增强大型语言模型的幻觉检测,显著提升了段落级检测的性能。
English Summary: This paper introduces a method that enhances hallucination detection in Large Language Models by constructing a semantic graph to capture token and sentence relations, incorporating uncertainty propagation and calibration, which significantly improves performance in passage-level detection.

Authors:Mian Muhammad Naeem Abid, Nancy Mehta, Zongwei Wu, Radu Timofte
Title: ContextFormer: Redefining Efficiency in Semantic Segmentation
Abstract:
Semantic segmentation assigns labels to pixels in images, a critical yet challenging task in computer vision. Convolutional methods, although capturing local dependencies well, struggle with long-range relationships. Vision Transformers (ViTs) excel in global context capture but are hindered by high computational demands, especially for high-resolution inputs. Most research optimizes the encoder architecture, leaving the bottleneck underexplored - a key area for enhancing performance and efficiency. We propose ContextFormer, a hybrid framework leveraging the strengths of CNNs and ViTs in the bottleneck to balance efficiency, accuracy, and robustness for real-time semantic segmentation. The framework's efficiency is driven by three synergistic modules: the Token Pyramid Extraction Module (TPEM) for hierarchical multi-scale representation, the Transformer and Branched DepthwiseConv (Trans-BDC) block for dynamic scale-aware feature modeling, and the Feature Merging Module (FMM) for robust integration with enhanced spatial and contextual consistency. Extensive experiments on ADE20K, Pascal Context, CityScapes, and COCO-Stuff datasets show ContextFormer significantly outperforms existing models, achieving state-of-the-art mIoU scores, setting a new benchmark for efficiency and performance. The codes will be made publicly available upon acceptance.
中文: ContextFormer是一种混合框架,在瓶颈部分结合了CNN和视觉Transformer的优势,以提升实时语义分割的效率和准确性,在多个数据集上实现了最先进的性能。
English: ContextFormer is a hybrid framework that combines CNNs and Vision Transformers in the bottleneck to enhance efficiency and accuracy for real-time semantic segmentation, achieving state-of-the-art results across multiple datasets.

Authors:Chen Chen, Yuchen Hu, Siyin Wang, Helin Wang, Zhehuai Chen, Chao Zhang, Chao-Han Huck Yang, Eng Siong Chng
Title: Audio Large Language Models Can Be Descriptive Speech Quality Evaluators
Abstract:
An ideal multimodal agent should be aware of the quality of its input modalities. Recent advances have enabled large language models (LLMs) to incorporate auditory systems for handling various speech-related tasks. However, most audio LLMs remain unaware of the quality of the speech they process. This limitation arises because speech quality evaluation is typically excluded from multi-task training due to the lack of suitable datasets. To address this, we introduce the first natural language-based speech evaluation corpus, generated from authentic human ratings. In addition to the overall Mean Opinion Score (MOS), this corpus offers detailed analysis across multiple dimensions and identifies causes of quality degradation. It also enables descriptive comparisons between two speech samples (A/B tests) with human-like judgment. Leveraging this corpus, we propose an alignment approach with LLM distillation (ALLD) to guide the audio LLM in extracting relevant information from raw speech and generating meaningful responses. Experimental results demonstrate that ALLD outperforms the previous state-of-the-art regression model in MOS prediction, with a mean square error of 0.17 and an A/B test accuracy of 98.6%. Additionally, the generated responses achieve BLEU scores of 25.8 and 30.2 on two tasks, surpassing the capabilities of task-specific models. This work advances the comprehensive perception of speech signals by audio LLMs, contributing to the development of real-world auditory and sensory intelligent agents.
中文: 本研究针对音频大语言模型缺乏语音质量感知的问题,首次引入基于自然语言的语音评估语料库,并提出一种结合LLM蒸馏的对齐方法,显著提升了语音质量评估和响应生成能力。
English: This study introduces the first natural language-based speech evaluation corpus to address the lack of quality awareness in audio large language models, proposing an alignment approach with LLM distillation that significantly improves speech quality assessment and response generation.

Authors:Zitong Li, Qingqing Ye, Haibo Hu
Title: FUNU: Boosting Machine Unlearning Efficiency by Filtering Unnecessary Unlearning
Abstract:
Machine unlearning is an emerging field that selectively removes specific data samples from a trained model. This capability is crucial for addressing privacy concerns, complying with data protection regulations, and correcting errors or biases introduced by certain data. Unlike traditional machine learning, where models are typically static once trained, machine unlearning facilitates dynamic updates that enable the model to ``forget'' information without requiring complete retraining from scratch. There are various machine unlearning methods, some of which are more time-efficient when data removal requests are fewer. To decrease the execution time of such machine unlearning methods, we aim to reduce the size of data removal requests based on the fundamental assumption that the removal of certain data would not result in a distinguishable retrained model. We first propose the concept of unnecessary unlearning, which indicates that the model would not alter noticeably after removing some data points. Subsequently, we review existing solutions that can be used to solve our problem. We highlight their limitations in adaptability to different unlearning scenarios and their reliance on manually selected parameters. We consequently put forward FUNU, a method to identify data points that lead to unnecessary unlearning. FUNU circumvents the limitations of existing solutions. The idea is to discover data points within the removal requests that have similar neighbors in the remaining dataset. We utilize a reference model to set parameters for finding neighbors, inspired from the area of model memorization. We provide a theoretical analysis of the privacy guarantee offered by FUNU and conduct extensive experiments to validate its efficacy.
中文摘要:本文提出FUNU方法,通过识别数据集中相似邻居来确定不必要的删除数据,从而减少机器遗忘的执行时间,同时保持模型性能和隐私保护。
English Summary: The paper introduces FUNU, an efficient machine unlearning method that identifies unnecessary data removals by finding similar neighbors in the dataset, reducing execution time while maintaining model performance and privacy.

Authors:Nan Gao, Jia Li, Huaibo Huang, Ke Shang, Ran He
Title: InfoBFR: Real-World Blind Face Restoration via Information Bottleneck
Abstract:
Blind face restoration (BFR) is a highly challenging problem due to the uncertainty of data degradation patterns. Current BFR methods have realized certain restored productions but with inherent neural degradations that limit real-world generalization in complicated scenarios. In this paper, we propose a plug-and-play framework InfoBFR to tackle neural degradations, e.g., prior bias, topological distortion, textural distortion, and artifact residues, which achieves high-generalization face restoration in diverse wild and heterogeneous scenes. Specifically, based on the results from pre-trained BFR models, InfoBFR considers information compression using manifold information bottleneck (MIB) and information compensation with efficient diffusion LoRA to conduct information optimization. InfoBFR effectively synthesizes high-fidelity faces without attribute and identity distortions. Comprehensive experimental results demonstrate the superiority of InfoBFR over state-of-the-art GAN-based and diffusion-based BFR methods, with around 70ms consumption, 16M trainable parameters, and nearly 85% BFR-boosting. It is promising that InfoBFR will be the first plug-and-play restorer universally employed by diverse BFR models to conquer neural degradations.
中文:本文提出InfoBFR这一即插即用框架,通过信息优化解决盲人脸修复中的神经退化问题,能以较低计算成本实现高保真效果,并在多样场景中具备广泛适用性。
English: This paper introduces InfoBFR, a plug-and-play framework that addresses neural degradations in blind face restoration through information optimization, achieving high-fidelity results with minimal computational cost and broad applicability across diverse scenarios.

Authors:Xunxin Cai, Chengrui Wang, Qingqing Long, Yuanchun Zhou, Meng Xiao
Title: Knowledge Hierarchy Guided Biological-Medical Dataset Distillation for Domain LLM Training
Abstract:
The rapid advancement of large language models (LLMs) in biological-medical applications has highlighted a gap between their potential and the limited scale and often low quality of available open-source annotated textual datasets. In addition, the inherent complexity of the biomedical knowledge hierarchy significantly hampers efforts to bridge this gap.Can LLMs themselves play a pivotal role in overcoming this limitation? Motivated by this question, we investigate this challenge in the present study.We propose a framework that automates the distillation of high-quality textual training data from the extensive scientific literature. Our approach self-evaluates and generates questions that are more closely aligned with the biomedical domain, guided by the biomedical knowledge hierarchy through medical subject headings (MeSH). This comprehensive framework establishes an automated workflow, thereby eliminating the need for manual intervention. Furthermore, we conducted comprehensive experiments to evaluate the impact of our framework-generated data on downstream language models of varying sizes. Our approach substantially improves question-answering tasks compared to pre-trained models from the life sciences domain and powerful close-source models represented by GPT-4. Notably, the generated AI-Ready dataset enabled the Llama3-70B base model to outperform GPT-4 using MedPrompt with multiple times the number of parameters. Detailed case studies and ablation experiments underscore the significance of each component within our framework
中文: 本研究提出一种自动化框架,利用大语言模型从科学文献中生成高质量生物医学训练数据,显著提升问答任务性能,并使Llama3-70B等模型超越GPT-4。
English: This study introduces an automated framework that leverages large language models to generate high-quality biomedical training data from scientific literature, significantly enhancing question-answering performance and enabling models like Llama3-70B to surpass GPT-4.

Authors:Hao Cheng, Erjia Xiao, Jing Shao, Yichi Wang, Le Yang, Chao Shen, Philip Torr, Jindong Gu, Renjing Xu
Title: Jailbreak-AudioBench: In-Depth Evaluation and Analysis of Jailbreak Threats for Large Audio Language Models
Abstract:
Large Language Models (LLMs) demonstrate impressive zero-shot performance across a wide range of natural language processing tasks. Integrating various modality encoders further expands their capabilities, giving rise to Multimodal Large Language Models (MLLMs) that process not only text but also visual and auditory modality inputs. However, these advanced capabilities may also pose significant security risks, as models can be exploited to generate harmful or inappropriate content through jailbreak attack. While prior work has extensively explored how manipulating textual or visual modality inputs can circumvent safeguards in LLMs and MLLMs, the vulnerability of audio-specific Jailbreak on Large Audio-Language Models (LALMs) remains largely underexplored. To address this gap, we introduce \textbf{Jailbreak-AudioBench}, which consists of the Toolbox, curated Dataset, and comprehensive Benchmark. The Toolbox supports not only text-to-audio conversion but also various editing techniques for injecting audio hidden semantics. The curated Dataset provides diverse explicit and implicit jailbreak audio examples in both original and edited forms. Utilizing this dataset, we evaluate multiple state-of-the-art LALMs and establish the most comprehensive Jailbreak benchmark to date for audio modality. Finally, Jailbreak-AudioBench establishes a foundation for advancing future research on LALMs safety alignment by enabling the in-depth exposure of more powerful jailbreak threats, such as query-based audio editing, and by facilitating the development of effective defense mechanisms.
中文摘要:大型语言模型及其多模态扩展面临越狱攻击的安全风险,为此我们开发了Jailbreak-AudioBench,通过集成工具集、精选数据集和全面基准测试,专门评估和应对大型音频语言模型中音频模态的越狱威胁。
English Summary: Large Language Models (LLMs) and their multimodal extensions face security vulnerabilities from jailbreak attacks, prompting the creation of Jailbreak-AudioBench to assess and address audio-specific threats in Large Audio-Language Models (LALMs) through a comprehensive toolkit, dataset, and benchmark.

Authors:Ruisi Zhao, Zechuan Zhang, Zongxin Yang, Yi Yang
Title: 3D Object Manipulation in a Single Image using Generative Models
Abstract:
Object manipulation in images aims to not only edit the object's presentation but also gift objects with motion. Previous methods encountered challenges in concurrently handling static editing and dynamic generation, while also struggling to achieve fidelity in object appearance and scene lighting. In this work, we introduce \textbf{OMG3D}, a novel framework that integrates the precise geometric control with the generative power of diffusion models, thus achieving significant enhancements in visual performance. Our framework first converts 2D objects into 3D, enabling user-directed modifications and lifelike motions at the geometric level. To address texture realism, we propose CustomRefiner, a texture refinement module that pre-train a customized diffusion model, aligning the details and style of coarse renderings of 3D rough model with the original image, further refine the texture. Additionally, we introduce IllumiCombiner, a lighting processing module that estimates and corrects background lighting to match human visual perception, resulting in more realistic shadow effects. Extensive experiments demonstrate the outstanding visual performance of our approach in both static and dynamic scenarios. Remarkably, all these steps can be done using one NVIDIA 3090. Project page is at https://whalesong-zrs.github.io/OMG3D-projectpage/
中文: OMG3D框架通过将3D几何控制与扩散模型相结合,实现了图像中物体的精确编辑和生动运动,其定制优化器和光照合成器模块分别提升了纹理真实感与光影效果,在静态和动态场景中均展现出卓越的视觉表现。
English: The OMG3D framework enhances image object manipulation by integrating 3D geometric control with diffusion models, enabling precise editing and realistic motion, while its CustomRefiner and IllumiCombiner modules improve texture and lighting for superior visual fidelity in both static and dynamic scenarios.

Authors:Nikos Kanakaris, Heng Ping, Xiongye Xiao, Nesreen K. Ahmed, Luca Luceri, Emilio Ferrara, Paul Bogdan
Title: Network-informed Prompt Engineering against Organized Astroturf Campaigns under Extreme Class Imbalance
Abstract:
Detecting organized political campaigns is of paramount importance in fighting against disinformation on social media. Existing approaches for the identification of such organized actions employ techniques mostly from network science, graph machine learning and natural language processing. Their ultimate goal is to analyze the relationships and interactions (e.g. re-posting) among users and the textual similarities of their posts. Despite their effectiveness in recognizing astroturf campaigns, these methods face significant challenges, notably the class imbalance in available training datasets. To mitigate this issue, recent methods usually resort to data augmentation or increasing the number of positive samples, which may not always be feasible or sufficient in real-world settings. Following a different path, in this paper, we propose a novel framework for identifying astroturf campaigns based solely on large language models (LLMs), introducing a Balanced Retrieval-Augmented Generation (Balanced RAG) component. Our approach first gives both textual information concerning the posts (in our case tweets) and the user interactions of the social network as input to a language model. Then, through prompt engineering and the proposed Balanced RAG method, it effectively detects coordinated disinformation campaigns on X (Twitter). The proposed framework does not require any training or fine-tuning of the language model. Instead, by strategically harnessing the strengths of prompt engineering and Balanced RAG, it facilitates LLMs to overcome the effects of class imbalance and effectively identify coordinated political campaigns. The experimental results demonstrate that by incorporating the proposed prompt engineering and Balanced RAG methods, our framework outperforms the traditional graph-based baselines, achieving 2x-3x improvements in terms of precision, recall and F1 scores.
中文摘要:本文提出了一种基于大型语言模型的新型框架,通过平衡检索增强生成技术有效识别社交媒体上的协同政治虚假宣传活动,无需模型训练即可显著超越传统基于图的方法。
English Summary: This paper introduces a novel framework using large language models (LLMs) with Balanced Retrieval-Augmented Generation to effectively detect coordinated political disinformation campaigns on social media without requiring model training, significantly outperforming traditional graph-based methods.

Authors:Matteo Zecchin, Fredrik Hellström, Sangwoo Park, Shlomo Shamai, Osvaldo Simeone
Title: Generalization and Informativeness of Weighted Conformal Risk Control Under Covariate Shift
Abstract:
Predictive models are often required to produce reliable predictions under statistical conditions that are not matched to the training data. A common type of training-testing mismatch is covariate shift, where the conditional distribution of the target variable given the input features remains fixed, while the marginal distribution of the inputs changes. Weighted conformal risk control (W-CRC) uses data collected during the training phase to convert point predictions into prediction sets with valid risk guarantees at test time despite the presence of a covariate shift. However, while W-CRC provides statistical reliability, its efficiency -- measured by the size of the prediction sets -- can only be assessed at test time. In this work, we relate the generalization properties of the base predictor to the efficiency of W-CRC under covariate shifts. Specifically, we derive a bound on the inefficiency of the W-CRC predictor that depends on algorithmic hyperparameters and task-specific quantities available at training time. This bound offers insights on relationships between the informativeness of the prediction sets, the extent of the covariate shift, and the size of the calibration and training sets. Experiments on fingerprinting-based localization validate the theoretical results.
中文:加权保形风险控制(W-CRC)利用训练数据在协变量偏移下确保可靠的预测集,而本研究通过建立其效率下界,将泛化特性、偏移程度与数据集规模相关联,为实践提供理论依据。
English: Weighted conformal risk control (W-CRC) ensures reliable prediction sets under covariate shift by leveraging training data, while this work establishes a bound on its inefficiency that links generalization properties, shift extent, and dataset sizes for practical insights.

Authors:Lipeng Ma, Weidong Yang, Yixuan Li, Ben Fei, Mingjie Zhou, Shuhao Li, Sihang Jiang, Bo Xu, Yanghua Xiao
Title: AdaptiveLog: An Adaptive Log Analysis Framework with the Collaboration of Large and Small Language Model
Abstract:
Automated log analysis is crucial to ensure high availability and reliability of complex systems. The advent of LLMs in NLP has ushered in a new era of language model-driven automated log analysis, garnering significant interest. Within this field, two primary paradigms based on language models for log analysis have become prominent. Small Language Models (SLMs) follow the pre-train and fine-tune paradigm, focusing on the specific log analysis task through fine-tuning on supervised datasets. On the other hand, LLMs following the in-context learning paradigm, analyze logs by providing a few examples in prompt contexts without updating parameters. Despite their respective strengths, we notice that SLMs are more cost-effective but less powerful, whereas LLMs with large parameters are highly powerful but expensive and inefficient. To trade-off between the performance and inference costs of both models in automated log analysis, this paper introduces an adaptive log analysis framework known as AdaptiveLog, which effectively reduces the costs associated with LLM while ensuring superior results. This framework collaborates an LLM and a small language model, strategically allocating the LLM to tackle complex logs while delegating simpler logs to the SLM. Specifically, to efficiently query the LLM, we propose an adaptive selection strategy based on the uncertainty estimation of the SLM, where the LLM is invoked only when the SLM is uncertain. In addition, to enhance the reasoning ability of the LLM in log analysis tasks, we propose a novel prompt strategy by retrieving similar error-prone cases as the reference, enabling the model to leverage past error experiences and learn solutions from these cases. Extensive experiments demonstrate that AdaptiveLog achieves state-of-the-art results across different tasks, elevating the overall accuracy of log analysis while maintaining cost efficiency.
中文: 本文提出AdaptiveLog框架,通过结合大型和小型语言模型,将复杂日志分析任务分配给LLM,简单任务分配给SLM,在提升分析准确性的同时有效降低了成本。
English: This paper introduces AdaptiveLog, a framework that combines large and small language models to optimize automated log analysis by assigning complex tasks to LLMs and simpler ones to SLMs, thereby enhancing accuracy while reducing costs.

Authors:Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, Sergey Levine
Title: FAST: Efficient Action Tokenization for Vision-Language-Action Models
Abstract:
Autoregressive sequence models, such as Transformer-based vision-language action (VLA) policies, can be tremendously effective for capturing complex and generalizable robotic behaviors. However, such models require us to choose a tokenization of our continuous action signals, which determines how the discrete symbols predicted by the model map to continuous robot actions. We find that current approaches for robot action tokenization, based on simple per-dimension, per-timestep binning schemes, typically perform poorly when learning dexterous skills from high-frequency robot data. To address this challenge, we propose a new compression-based tokenization scheme for robot actions, based on the discrete cosine transform. Our tokenization approach, Frequency-space Action Sequence Tokenization (FAST), enables us to train autoregressive VLAs for highly dexterous and high-frequency tasks where standard discretization methods fail completely. Based on FAST, we release FAST+, a universal robot action tokenizer, trained on 1M real robot action trajectories. It can be used as a black-box tokenizer for a wide range of robot action sequences, with diverse action spaces and control frequencies. Finally, we show that, when combined with the pi0 VLA, our method can scale to training on 10k hours of robot data and match the performance of diffusion VLAs, while reducing training time by up to 5x.
中文:自回归序列模型(如基于Transformer的视觉语言动作策略)在捕捉复杂机器人行为方面非常有效,但面临动作标记化的挑战,而提出的FAST+方法通过离散余弦变换实现了高效压缩和训练,解决了这一问题。
English: Autoregressive sequence models like Transformer-based vision-language action policies are effective for robotic behaviors but struggle with action tokenization, which is addressed by the proposed FAST+ method using discrete cosine transform for efficient compression and training.

Authors:Zerui Tao, Yuhta Takida, Naoki Murata, Qibin Zhao, Yuki Mitsufuji
Title: Transformed Low-rank Adaptation via Tensor Decomposition and Its Applications to Text-to-image Models
Abstract:
Parameter-Efficient Fine-Tuning (PEFT) of text-to-image models has become an increasingly popular technique with many applications. Among the various PEFT methods, Low-Rank Adaptation (LoRA) and its variants have gained significant attention due to their effectiveness, enabling users to fine-tune models with limited computational resources. However, the approximation gap between the low-rank assumption and desired fine-tuning weights prevents the simultaneous acquisition of ultra-parameter-efficiency and better performance. To reduce this gap and further improve the power of LoRA, we propose a new PEFT method that combines two classes of adaptations, namely, transform and residual adaptations. In specific, we first apply a full-rank and dense transform to the pre-trained weight. This learnable transform is expected to align the pre-trained weight as closely as possible to the desired weight, thereby reducing the rank of the residual weight. Then, the residual part can be effectively approximated by more compact and parameter-efficient structures, with a smaller approximation error. To achieve ultra-parameter-efficiency in practice, we design highly flexible and effective tensor decompositions for both the transform and residual adaptations. Additionally, popular PEFT methods such as DoRA can be summarized under this transform plus residual adaptation scheme. Experiments are conducted on fine-tuning Stable Diffusion models in subject-driven and controllable generation. The results manifest that our method can achieve better performances and parameter efficiency compared to LoRA and several baselines.
中文: 本文提出了一种新颖的参数高效微调方法,通过结合变换和残差适应来减小LoRA中的近似差距,在文本到图像模型微调中实现了更优的性能和参数效率。
English: This paper introduces a novel Parameter-Efficient Fine-Tuning method that combines transform and residual adaptations to reduce the approximation gap in LoRA, achieving superior performance and parameter efficiency in text-to-image model fine-tuning.

Authors:Chenguang Liu, Yongchao Feng, Yanan Zhang, Qingjie Liu, Yunhong Wang
Title: PACF: Prototype Augmented Compact Features for Improving Domain Adaptive Object Detection
Abstract:
In recent years, there has been significant advancement in object detection. However, applying off-the-shelf detectors to a new domain leads to significant performance drop, caused by the domain gap. These detectors exhibit higher-variance class-conditional distributions in the target domain than that in the source domain, along with mean shift. To address this problem, we propose the Prototype Augmented Compact Features (PACF) framework to regularize the distribution of intra-class features. Specifically, we provide an in-depth theoretical analysis on the lower bound of the target features-related likelihood and derive the prototype cross entropy loss to further calibrate the distribution of target RoI features. Furthermore, a mutual regularization strategy is designed to enable the linear and prototype-based classifiers to learn from each other, promoting feature compactness while enhancing discriminability. Thanks to this PACF framework, we have obtained a more compact cross-domain feature space, within which the variance of the target features' class-conditional distributions has significantly decreased, and the class-mean shift between the two domains has also been further reduced. The results on different adaptation settings are state-of-the-art, which demonstrate the board applicability and effectiveness of the proposed approach.
中文: 提出的原型增强紧凑特征(PACF)框架通过规范化类内特征分布,有效减少目标检测中的领域差距,并借助原型分类器和相互正则化策略实现了最先进的性能。
English: The proposed Prototype Augmented Compact Features (PACF) framework effectively reduces domain gap in object detection by regularizing intra-class feature distributions, achieving state-of-the-art performance through prototype-based classifiers and mutual regularization.

Authors:Jiayi Huang, Sangwoo Park, Nicola Paoletti, Osvaldo Simeone
Title: Distilling Calibration via Conformalized Credal Inference
Abstract:
Deploying artificial intelligence (AI) models on edge devices involves a delicate balance between meeting stringent complexity constraints, such as limited memory and energy resources, and ensuring reliable performance in sensitive decision-making tasks. One way to enhance reliability is through uncertainty quantification via Bayesian inference. This approach, however, typically necessitates maintaining and running multiple models in an ensemble, which may exceed the computational limits of edge devices. This paper introduces a low-complexity methodology to address this challenge by distilling calibration information from a more complex model. In an offline phase, predictive probabilities generated by a high-complexity cloud-based model are leveraged to determine a threshold based on the typical divergence between the cloud and edge models. At run time, this threshold is used to construct credal sets -- ranges of predictive probabilities that are guaranteed, with a user-selected confidence level, to include the predictions of the cloud model. The credal sets are obtained through thresholding of a divergence measure in the simplex of predictive probabilities. Experiments on visual and language tasks demonstrate that the proposed approach, termed Conformalized Distillation for Credal Inference (CD-CI), significantly improves calibration performance compared to low-complexity Bayesian methods, such as Laplace approximation, making it a practical and efficient solution for edge AI deployments.
中文摘要:本文提出的CD-CI方法通过从云端模型提取校准信息,为边缘AI实现了可靠的置信推理,在严格遵循计算限制的同时,其校准性能显著优于传统贝叶斯方法。
English Summary: This paper introduces CD-CI, a low-complexity method that distills calibration information from cloud models to enable reliable uncertainty quantification for edge AI, significantly outperforming traditional Bayesian approaches while respecting computational constraints.

Authors:Jiayi Huang, Sangwoo Park, Nicola Paoletti, Osvaldo Simeone
Title: Distilling Calibration via Conformalized Credal Inference
Abstract:
Deploying artificial intelligence (AI) models on edge devices involves a delicate balance between meeting stringent complexity constraints, such as limited memory and energy resources, and ensuring reliable performance in sensitive decision-making tasks. One way to enhance reliability is through uncertainty quantification via Bayesian inference. This approach, however, typically necessitates maintaining and running multiple models in an ensemble, which may exceed the computational limits of edge devices. This paper introduces a low-complexity methodology to address this challenge by distilling calibration information from a more complex model. In an offline phase, predictive probabilities generated by a high-complexity cloud-based model are leveraged to determine a threshold based on the typical divergence between the cloud and edge models. At run time, this threshold is used to construct credal sets -- ranges of predictive probabilities that are guaranteed, with a user-selected confidence level, to include the predictions of the cloud model. The credal sets are obtained through thresholding of a divergence measure in the simplex of predictive probabilities. Experiments on visual and language tasks demonstrate that the proposed approach, termed Conformalized Distillation for Credal Inference (CD-CI), significantly improves calibration performance compared to low-complexity Bayesian methods, such as Laplace approximation, making it a practical and efficient solution for edge AI deployments.
中文摘要:本文提出的CD-CI方法通过从云端模型提取校准信息,为边缘AI实现了可靠的置信推理,在严格遵循计算限制的同时,其校准性能显著优于传统贝叶斯方法。
English Summary: This paper introduces CD-CI, a low-complexity method that distills calibration information from cloud models to enable reliable uncertainty quantification for edge AI, significantly outperforming traditional Bayesian approaches while respecting computational constraints.

Authors:Jiale Zhang, Bosen Rao, Chengcheng Zhu, Xiaobing Sun, Qingming Li, Haibo Hu, Xiapu Luo, Qingqing Ye, Shouling Ji
Title: Fine-tuning is Not Fine: Mitigating Backdoor Attacks in GNNs with Limited Clean Data
Abstract:
Graph Neural Networks (GNNs) have achieved remarkable performance through their message-passing mechanism. However, recent studies have highlighted the vulnerability of GNNs to backdoor attacks, which can lead the model to misclassify graphs with attached triggers as the target class. The effectiveness of recent promising defense techniques, such as fine-tuning or distillation, is heavily contingent on having comprehensive knowledge of the sufficient training dataset. Empirical studies have shown that fine-tuning methods require a clean dataset of 20% to reduce attack accuracy to below 25%, while distillation methods require a clean dataset of 15%. However, obtaining such a large amount of clean data is commonly impractical. In this paper, we propose a practical backdoor mitigation framework, denoted as GRAPHNAD, which can capture high-quality intermediate-layer representations in GNNs to enhance the distillation process with limited clean data. To achieve this, we address the following key questions: How to identify the appropriate attention representations in graphs for distillation? How to enhance distillation with limited data? By adopting the graph attention transfer method, GRAPHNAD can effectively align the intermediate-layer attention representations of the backdoored model with that of the teacher model, forcing the backdoor neurons to transform into benign ones. Besides, we extract the relation maps from intermediate-layer transformation and enforce the relation maps of the backdoored model to be consistent with that of the teacher model, thereby ensuring model accuracy while further reducing the influence of backdoors. Extensive experimental results show that by fine-tuning a teacher model with only 3% of the clean data, GRAPHNAD can reduce the attack success rate to below 5%.
Chinese: GRAPHNAD是一种实用的后门缓解框架,通过对齐中间层注意力表示和关系图,在有限干净数据下增强图神经网络的蒸馏过程,仅需3%干净数据即可将攻击成功率降至5%以下。
English: GRAPHNAD is a practical backdoor mitigation framework that enhances distillation in Graph Neural Networks using limited clean data by aligning intermediate-layer attention representations and relation maps, reducing attack success rates to below 5% with only 3% clean data.

Authors:Dewei Zhou, Ji Xie, Zongxin Yang, Yi Yang
Title: 3DIS-FLUX: simple and efficient multi-instance generation with DiT rendering
Abstract:
The growing demand for controllable outputs in text-to-image generation has driven significant advancements in multi-instance generation (MIG), enabling users to define both instance layouts and attributes. Currently, the state-of-the-art methods in MIG are primarily adapter-based. However, these methods necessitate retraining a new adapter each time a more advanced model is released, resulting in significant resource consumption. A methodology named Depth-Driven Decoupled Instance Synthesis (3DIS) has been introduced, which decouples MIG into two distinct phases: 1) depth-based scene construction and 2) detail rendering with widely pre-trained depth control models. The 3DIS method requires adapter training solely during the scene construction phase, while enabling various models to perform training-free detail rendering. Initially, 3DIS focused on rendering techniques utilizing U-Net architectures such as SD1.5, SD2, and SDXL, without exploring the potential of recent DiT-based models like FLUX. In this paper, we present 3DIS-FLUX, an extension of the 3DIS framework that integrates the FLUX model for enhanced rendering capabilities. Specifically, we employ the FLUX.1-Depth-dev model for depth map controlled image generation and introduce a detail renderer that manipulates the Attention Mask in FLUX's Joint Attention mechanism based on layout information. This approach allows for the precise rendering of fine-grained attributes of each instance. Our experimental results indicate that 3DIS-FLUX, leveraging the FLUX model, outperforms the original 3DIS method, which utilized SD2 and SDXL, and surpasses current state-of-the-art adapter-based methods in terms of both performance and image quality. Project Page: https://limuloo.github.io/3DIS/.
中文:3DIS-FLUX框架通过集成FLUX模型扩展了原有3DIS方法,将多实例生成解耦为场景构建与细节渲染两个阶段,在性能和图像质量上均超越了当前基于适配器的先进方法。
English: The 3DIS-FLUX framework extends the original 3DIS method by integrating the FLUX model for enhanced multi-instance generation, achieving superior performance and image quality compared to existing adapter-based approaches by decoupling scene construction and detail rendering.

Authors:Meng Xiao, Weiliang Zhang, Xiaohan Huang, Hengshu Zhu, Min Wu, Xiaoli Li, Yuanchun Zhou
Title: Knowledge-Guided Biomarker Identification for Label-Free Single-Cell RNA-Seq Data: A Reinforcement Learning Perspective
Abstract:
Gene panel selection aims to identify the most informative genomic biomarkers in label-free genomic datasets. Traditional approaches, which rely on domain expertise, embedded machine learning models, or heuristic-based iterative optimization, often introduce biases and inefficiencies, potentially obscuring critical biological signals. To address these challenges, we present an iterative gene panel selection strategy that harnesses ensemble knowledge from existing gene selection algorithms to establish preliminary boundaries or prior knowledge, which guide the initial search space. Subsequently, we incorporate reinforcement learning through a reward function shaped by expert behavior, enabling dynamic refinement and targeted selection of gene panels. This integration mitigates biases stemming from initial boundaries while capitalizing on RL's stochastic adaptability. Comprehensive comparative experiments, case studies, and downstream analyses demonstrate the effectiveness of our method, highlighting its improved precision and efficiency for label-free biomarker discovery. Our results underscore the potential of this approach to advance single-cell genomics data analysis.
Chinese: 本研究提出一种迭代基因面板选择方法,通过整合现有算法的集成知识与基于专家行为的强化学习,有效降低偏差并提升无标签数据集中基因组生物标志物识别的精确度。
English: This study introduces an iterative gene panel selection method that combines ensemble knowledge from existing algorithms with reinforcement learning guided by expert behavior, effectively reducing biases and enhancing precision in identifying genomic biomarkers from label-free datasets.

Authors:Song Wang, Xiaodong Yang, Rashidul Islam, Huiyuan Chen, Minghua Xu, Jundong Li, Yiwei Cai
Title: Enhancing Distribution and Label Consistency for Graph Out-of-Distribution Generalization
Abstract:
To deal with distribution shifts in graph data, various graph out-of-distribution (OOD) generalization techniques have been recently proposed. These methods often employ a two-step strategy that first creates augmented environments and subsequently identifies invariant subgraphs to improve generalizability. Nevertheless, this approach could be suboptimal from the perspective of consistency. First, the process of augmenting environments by altering the graphs while preserving labels may lead to graphs that are not realistic or meaningfully related to the origin distribution, thus lacking distribution consistency. Second, the extracted subgraphs are obtained from directly modifying graphs, and may not necessarily maintain a consistent predictive relationship with their labels, thereby impacting label consistency. In response to these challenges, we introduce an innovative approach that aims to enhance these two types of consistency for graph OOD generalization. We propose a modifier to obtain both augmented and invariant graphs in a unified manner. With the augmented graphs, we enrich the training data without compromising the integrity of label-graph relationships. The label consistency enhancement in our framework further preserves the supervision information in the invariant graph. We conduct extensive experiments on real-world datasets to demonstrate the superiority of our framework over other state-of-the-art baselines.
中文: 本文提出了一种创新方法,通过统一图增强和不变子图提取来提升分布与标签一致性,从而改进图数据的分布外泛化能力,并在真实数据集上通过实验验证了其优越性。
English: This paper introduces a novel approach to enhance graph out-of-distribution generalization by improving distribution and label consistency through unified graph augmentation and invariant subgraph extraction, validated by superior experimental results on real-world datasets.

Authors:Sen Zhang, Qingqing Ye, Haibo Hu
Title: Structure-Preference Enabled Graph Embedding Generation under Differential Privacy
Abstract:
Graph embedding generation techniques aim to learn low-dimensional vectors for each node in a graph and have recently gained increasing research attention. Publishing low-dimensional node vectors enables various graph analysis tasks, such as structural equivalence and link prediction. Yet, improper publication opens a backdoor to malicious attackers, who can infer sensitive information of individuals from the low-dimensional node vectors. Existing methods tackle this issue by developing deep graph learning models with differential privacy (DP). However, they often suffer from large noise injections and cannot provide structural preferences consistent with mining objectives. Recently, skip-gram based graph embedding generation techniques are widely used due to their ability to extract customizable structures. Based on skip-gram, we present SE-PrivGEmb, a structure-preference enabled graph embedding generation under DP. For arbitrary structure preferences, we design a unified noise tolerance mechanism via perturbing non-zero vectors. This mechanism mitigates utility degradation caused by high sensitivity. By carefully designing negative sampling probabilities in skip-gram, we theoretically demonstrate that skip-gram can preserve arbitrary proximities, which quantify structural features in graphs. Extensive experiments show that our method outperforms existing state-of-the-art methods under structural equivalence and link prediction tasks.
中文: SE-PrivGEmb方法通过扰动非零向量和优化负采样生成具备差分隐私的图嵌入,在保持结构特征的同时,在结构等价性和链接预测任务上优于现有方法。
English: The proposed SE-PrivGEmb method generates differentially private graph embeddings by perturbing non-zero vectors and optimizing negative sampling, effectively preserving structural features while outperforming existing approaches in tasks like structural equivalence and link prediction.

Authors:Tianyang Wang, Yunze Wang, Jun Zhou, Benji Peng, Xinyuan Song, Charles Zhang, Xintian Sun, Qian Niu, Junyu Liu, Silin Chen, Keyu Chen, Ming Li, Pohsun Feng, Ziqian Bi, Ming Liu, Yichao Zhang, Cheng Fei, Caitlyn Heqi Yin, Lawrence KQ Yan
Title: From Aleatoric to Epistemic: Exploring Uncertainty Quantification Techniques in Artificial Intelligence
Abstract:
Uncertainty quantification (UQ) is a critical aspect of artificial intelligence (AI) systems, particularly in high-risk domains such as healthcare, autonomous systems, and financial technology, where decision-making processes must account for uncertainty. This review explores the evolution of uncertainty quantification techniques in AI, distinguishing between aleatoric and epistemic uncertainties, and discusses the mathematical foundations and methods used to quantify these uncertainties. We provide an overview of advanced techniques, including probabilistic methods, ensemble learning, sampling-based approaches, and generative models, while also highlighting hybrid approaches that integrate domain-specific knowledge. Furthermore, we examine the diverse applications of UQ across various fields, emphasizing its impact on decision-making, predictive accuracy, and system robustness. The review also addresses key challenges such as scalability, efficiency, and integration with explainable AI, and outlines future directions for research in this rapidly developing area. Through this comprehensive survey, we aim to provide a deeper understanding of UQ's role in enhancing the reliability, safety, and trustworthiness of AI systems.
Chinese: 本综述探讨了人工智能中不确定性量化技术的演进,涵盖其数学基础、概率方法和集成学习等多种技术,以及在医疗和金融等高风险领域的应用,同时指出了可扩展性和可解释性等挑战及未来研究方向,旨在提升AI系统的可靠性与安全性。
English: This review examines uncertainty quantification techniques in AI, covering their mathematical foundations, methods like probabilistic modeling and ensemble learning, and applications in high-risk fields, while addressing challenges and future research directions to enhance AI reliability and safety.

Authors:Zaiyi Zheng, Yushun Dong, Song Wang, Haochen Liu, Qi Wang, Jundong Li
Title: KG-CF: Knowledge Graph Completion with Context Filtering under the Guidance of Large Language Models
Abstract:
Large Language Models (LLMs) have shown impressive performance in various tasks, including knowledge graph completion (KGC). However, current studies mostly apply LLMs to classification tasks, like identifying missing triplets, rather than ranking-based tasks, where the model ranks candidate entities based on plausibility. This focus limits the practical use of LLMs in KGC, as real-world applications prioritize highly plausible triplets. Additionally, while graph paths can help infer the existence of missing triplets and improve completion accuracy, they often contain redundant information. To address these issues, we propose KG-CF, a framework tailored for ranking-based KGC tasks. KG-CF leverages LLMs' reasoning abilities to filter out irrelevant contexts, achieving superior results on real-world datasets. The code and datasets are available at \url{https://anonymous.4open.science/r/KG-CF}.
中文: 提出的KG-CF框架通过利用大语言模型过滤无关信息,有效提升了基于排序的知识图谱补全任务在真实数据集上的表现。
English: The proposed KG-CF framework enhances ranking-based knowledge graph completion by using LLMs to filter irrelevant contexts, achieving superior performance on real-world datasets.

Authors:Wenhao Wang, Yifan Sun, Zongxin Yang, Zhentao Tan, Zhengdong Hu, Yi Yang
Title: Origin Identification for Text-Guided Image-to-Image Diffusion Models
Abstract:
Text-guided image-to-image diffusion models excel in translating images based on textual prompts, allowing for precise and creative visual modifications. However, such a powerful technique can be misused for spreading misinformation, infringing on copyrights, and evading content tracing. This motivates us to introduce the task of origin IDentification for text-guided Image-to-image Diffusion models (ID$^2$), aiming to retrieve the original image of a given translated query. A straightforward solution to ID$^2$ involves training a specialized deep embedding model to extract and compare features from both query and reference images. However, due to visual discrepancy across generations produced by different diffusion models, this similarity-based approach fails when training on images from one model and testing on those from another, limiting its effectiveness in real-world applications. To solve this challenge of the proposed ID$^2$ task, we contribute the first dataset and a theoretically guaranteed method, both emphasizing generalizability. The curated dataset, OriPID, contains abundant Origins and guided Prompts, which can be used to train and test potential IDentification models across various diffusion models. In the method section, we first prove the existence of a linear transformation that minimizes the distance between the pre-trained Variational Autoencoder (VAE) embeddings of generated samples and their origins. Subsequently, it is demonstrated that such a simple linear transformation can be generalized across different diffusion models. Experimental results show that the proposed method achieves satisfying generalization performance, significantly surpassing similarity-based methods ($+31.6\%$ mAP), even those with generalization designs. The project is available at https://id2icml.github.io.
中文摘要:本文针对文本引导图像生成模型的溯源问题,提出ID²任务及跨模型通用解决方案,通过线性变换对齐潜在特征,在构建的数据集上实现比相似度方法提升31.6%的检索性能。
English Summary: This paper introduces the ID² task for tracing original images from text-guided diffusion model outputs, proposing a dataset and a linear transformation method that significantly outperforms similarity-based approaches by ensuring cross-model generalizability.

Authors:Sen Zhang, Haibo Hu, Qingqing Ye, Jianliang Xu
Title: PrivDPR: Synthetic Graph Publishing with Deep PageRank under Differential Privacy
Abstract:
The objective of privacy-preserving synthetic graph publishing is to safeguard individuals' privacy while retaining the utility of original data. Most existing methods focus on graph neural networks under differential privacy (DP), and yet two fundamental problems in generating synthetic graphs remain open. First, the current research often encounters high sensitivity due to the intricate relationships between nodes in a graph. Second, DP is usually achieved through advanced composition mechanisms that tend to converge prematurely when working with a small privacy budget. In this paper, inspired by the simplicity, effectiveness, and ease of analysis of PageRank, we design PrivDPR, a novel privacy-preserving deep PageRank for graph synthesis. In particular, we achieve DP by adding noise to the gradient for a specific weight during learning. Utilizing weight normalization as a bridge, we theoretically reveal that increasing the number of layers in PrivDPR can effectively mitigate the high sensitivity and privacy budget splitting. Through formal privacy analysis, we prove that the synthetic graph generated by PrivDPR satisfies node-level DP. Experiments on real-world graph datasets show that PrivDPR preserves high data utility across multiple graph structural properties.
Chinese: 本文提出了一种基于深度PageRank的新型隐私保护图合成方法PrivDPR,通过向梯度添加噪声并利用权重归一化解决了高敏感性和隐私预算问题,在保持高数据效用的同时实现了节点级差分隐私。
English: This paper introduces PrivDPR, a novel deep PageRank-based method for privacy-preserving graph synthesis that addresses high sensitivity and privacy budget issues by adding noise to gradients and utilizing weight normalization, achieving node-level differential privacy while maintaining high data utility.

Authors:Song-Lin Lv, Yu-Yang Chen, Zhi Zhou, Yu-Feng Li, Lan-Zhe Guo
Title: Contrast-Aware Calibration for Fine-Tuned CLIP: Leveraging Image-Text Alignment
Abstract:
Vision-language models (VLMs), such as CLIP, have demonstrated exceptional generalization capabilities and can quickly adapt to downstream tasks through prompt fine-tuning. Unfortunately, in classification tasks involving non-training classes, known as open-vocabulary setting, fine-tuned VLMs often overfit to train classes, resulting in a misalignment between confidence scores and actual accuracy on unseen classes, which significantly undermines their reliability in real-world deployments. Existing confidence calibration methods typically require training parameters or analyzing features from the training dataset, restricting their ability to generalize unseen classes without corresponding train data. Moreover, VLM-specific calibration methods rely solely on text features from train classes as calibration indicators, which inherently limits their ability to calibrate train classes. To address these challenges, we propose an effective multimodal calibration method Contrast-Aware Calibration (CAC). Building on the original CLIP's zero-shot adaptability and the conclusion from empirical analysis that poor intra-class and inter-class discriminative ability on unseen classes is the root cause, we calculate calibration weights based on the contrastive difference between the original and fine-tuned CLIP. This method not only adapts to calibrating unseen classes but also overcomes the limitations of previous VLM calibration methods that could not calibrate train classes. In experiments involving 11 datasets with 5 fine-tuning methods, CAC consistently achieved the best calibration effect on both train and unseen classes without sacrificing accuracy and inference speed.
Chinese: 针对视觉语言模型在开放词汇分类中容易对训练类别过拟合的问题,本文提出的对比感知校准方法通过利用原始与微调模型间的对比差异,在11个数据集上实现了对训练和未见类别的最佳校准效果,且不损失精度或推理速度。
English: Vision-language models like CLIP often overfit to training classes in open-vocabulary settings, but the proposed Contrast-Aware Calibration method effectively addresses this by leveraging contrastive differences between original and fine-tuned models, achieving superior calibration across 11 datasets without compromising accuracy or speed.

Authors:Hao-Zhe Tan, Zhi Zhou, Yu-Feng Li, Lan-Zhe Guo
Title: Vision-Language Model Selection and Reuse for Downstream Adaptation
Abstract:
Pre-trained Vision-Language Models (VLMs) are becoming increasingly popular across various visual tasks, and several open-sourced VLM variants have been released. However, selecting the best-performing pre-trained VLM for a specific downstream task is challenging since no single VLM can achieve promising performance on all downstream tasks, and evaluating all available VLMs is impossible due to time and data limitations. To address this problem, this paper proposes a novel paradigm to select and reuse VLM for downstream tasks, called Model Label Learning (MLL). The proposal contains three key modules: \emph{model labeling}, which assigns labels to each VLM to describe their specialty and utility; \emph{model selection}, which matches the requirements of the target task with model labels; and \emph{model reuse}, which applies selected VLMs to the target task in an ensemble manner. The proposal is highly computationally efficient and growable since the model labeling process is completed target task independent and the ability could grow with the number of candidate VLMs. We also introduce a new benchmark for evaluating VLM selection methods, including 49 VLMs and 17 target task datasets. Experimental results clearly demonstrate the effectiveness of the proposed method for selecting and reusing VLMs.
Chinese: 本文提出模型标签学习(MLL)方法,通过标注预训练视觉语言模型的专长、匹配任务需求并以集成方式重用,高效地针对下游任务选择和复用模型,在包含49个模型和17个数据集的新基准测试中验证了其有效性。
English: This paper introduces Model Label Learning (MLL), a computationally efficient paradigm that selects and reuses pre-trained Vision-Language Models for specific downstream tasks by labeling their specialties, matching them to task requirements, and applying them in ensembles, as validated on a new benchmark with 49 VLMs and 17 datasets.

Authors:Wenqi Li, Yingli Chen, Keyang Zhou, Xiaoxiao Hu, Zilu Zheng, Yue Yan, Xinpeng Zhang, Wei Tang, Zhenxing Qian
Title: An Exceptional Dataset For Rare Pancreatic Tumor Segmentation
Abstract:
Pancreatic NEuroendocrine Tumors (pNETs) are very rare endocrine neoplasms that account for less than 5% of all pancreatic malignancies, with an incidence of only 1-1.5 cases per 100,000. Early detection of pNETs is critical for improving patient survival, but the rarity of pNETs makes segmenting them from CT a very challenging problem. So far, there has not been a dataset specifically for pNETs available to researchers. To address this issue, we propose a pNETs dataset, a well-annotated Contrast-Enhanced Computed Tomography (CECT) dataset focused exclusively on Pancreatic Neuroendocrine Tumors, containing data from 469 patients. This is the first dataset solely dedicated to pNETs, distinguishing it from previous collections. Additionally, we provide the baseline detection networks with a new slice-wise weight loss function designed for the UNet-based model, improving the overall pNET segmentation performance. We hope that our dataset can enhance the understanding and diagnosis of pNET Tumors within the medical community, facilitate the development of more accurate diagnostic tools, and ultimately improve patient outcomes and advance the field of oncology.
中文: 本文推出了首个专门针对胰腺神经内分泌肿瘤的数据集,包含469例患者的增强CT扫描数据,并提出基于UNet的新型分割模型以提升诊疗水平,推动肿瘤学发展。
English: The authors introduce the first dedicated dataset for Pancreatic Neuroendocrine Tumors (pNETs), featuring 469 patient CECT scans and a novel UNet-based segmentation model with improved performance to advance diagnosis and treatment.

Authors:Long Peng, Xin Di, Zhanfeng Feng, Wenbo Li, Renjing Pei, Yang Wang, Xueyang Fu, Yang Cao, Zheng-Jun Zha
Title: Directing Mamba to Complex Textures: An Efficient Texture-Aware State Space Model for Image Restoration
Abstract:
Image restoration aims to recover details and enhance contrast in degraded images. With the growing demand for high-quality imaging (\textit{e.g.}, 4K and 8K), achieving a balance between restoration quality and computational efficiency has become increasingly critical. Existing methods, primarily based on CNNs, Transformers, or their hybrid approaches, apply uniform deep representation extraction across the image. However, these methods often struggle to effectively model long-range dependencies and largely overlook the spatial characteristics of image degradation (regions with richer textures tend to suffer more severe damage), making it hard to achieve the best trade-off between restoration quality and efficiency. To address these issues, we propose a novel texture-aware image restoration method, TAMambaIR, which simultaneously perceives image textures and achieves a trade-off between performance and efficiency. Specifically, we introduce a novel Texture-Aware State Space Model, which enhances texture awareness and improves efficiency by modulating the transition matrix of the state-space equation and focusing on regions with complex textures. Additionally, we design a {Multi-Directional Perception Block} to improve multi-directional receptive fields while maintaining low computational overhead. Extensive experiments on benchmarks for image super-resolution, deraining, and low-light image enhancement demonstrate that TAMambaIR achieves state-of-the-art performance with significantly improved efficiency, establishing it as a robust and efficient framework for image restoration.
中文: 该摘要提出TAMambaIR方法,通过纹理感知状态空间模型和多向感知模块,针对复杂纹理区域进行高效图像恢复,在超分辨率、去雨和低光增强任务中实现最优性能与计算效率的平衡。
English: The abstract introduces TAMambaIR, a texture-aware image restoration method that uses a novel state space model and multi-directional perception to efficiently enhance degraded areas with complex textures, achieving top performance in benchmarks while balancing quality and computational cost.

Authors:Subhadeep Koley, Viswanatha Reddy Gajjala, Aneeshan Sain, Pinaki Nath Chowdhury, Tao Xiang, Ayan Kumar Bhunia, Yi-Zhe Song
Title: SketchYourSeg: Mask-Free Subjective Image Segmentation via Freehand Sketches
Abstract:
We introduce SketchYourSeg, a novel framework that establishes freehand sketches as a powerful query modality for subjective image segmentation across entire galleries through a single exemplar sketch. Unlike text prompts that struggle with spatial specificity or interactive methods confined to single-image operations, sketches naturally combine semantic intent with structural precision. This unique dual encoding enables precise visual disambiguation for segmentation tasks where text descriptions would be cumbersome or ambiguous -- such as distinguishing between visually similar instances, specifying exact part boundaries, or indicating spatial relationships in composed concepts. Our approach addresses three fundamental challenges: (i) eliminating the need for pixel-perfect annotation masks during training with a mask-free framework; (ii) creating a synergistic relationship between sketch-based image retrieval (SBIR) models and foundation models (CLIP/DINOv2) where the former provides training signals while the latter generates masks; and (iii) enabling multi-granular segmentation capabilities through purpose-made sketch augmentation strategies. Our extensive evaluations demonstrate superior performance over existing approaches across diverse benchmarks, establishing a new paradigm for user-guided image segmentation that balances precision with efficiency.
中文: SketchYourSeg 提出了一种创新框架,将手绘草图作为主观图像分割的查询方式,通过结合草图检索与基础模型,无需像素级标注即可实现精确的视觉区分,在多项基准测试中表现卓越。
English: SketchYourSeg introduces a novel framework that uses freehand sketches as queries for subjective image segmentation, enabling precise visual disambiguation without pixel-perfect training masks by synergizing sketch-based retrieval and foundation models for superior performance across benchmarks.

Authors:Yadong Li, Jun Liu, Tao Zhang, Tao Zhang, Song Chen, Tianpeng Li, Zehuan Li, Lijun Liu, Lingfeng Ming, Guosheng Dong, Da Pan, Chong Li, Yuanbo Fang, Dongdong Kuang, Mingrui Wang, Chenglin Zhu, Youwei Zhang, Hongyu Guo, Fengyu Zhang, Yuran Wang, Bowen Ding, Wei Song, Xu Li, Yuqi Huo, Zheng Liang, Shusen Zhang, Xin Wu, Shuai Zhao, Linchu Xiong, Yozhen Wu, Jiahui Ye, Wenhao Lu, Bowen Li, Yan Zhang, Yaqi Zhou, Xin Chen, Lei Su, Hongda Zhang, Fuzhong Chen, Xuezhen Dong, Na Nie, Zhiying Wu, Bin Xiao, Ting Li, Shunya Dang, Ping Zhang, Yijia Sun, Jincheng Wu, Jinjie Yang, Xionghai Lin, Zhi Ma, Kegeng Wu, Jia li, Aiyuan Yang, Hui Liu, Jianqiang Zhang, Xiaoxi Chen, Guangwei Ai, Wentao Zhang, Yicong Chen, Xiaoqin Huang, Kun Li, Wenjing Luo, Yifei Duan, Lingling Zhu, Ran Xiao, Zhe Su, Jiani Pu, Dian Wang, Xu Jia, Tianyu Zhang, Mengyu Ai, Mang Wang, Yujing Qiao, Lei Zhang, Yanjun Shen, Fan Yang, Miao Zhen, Yijie Zhou, Mingyang Chen, Fei Li, Chenzheng Zhu, Keer Lu, Yaqi Zhao, Hao Liang, Youquan Li, Yanzhao Qin, Linzhuang Sun, Jianhua Xu, Haoze Sun, Mingan Lin, Zenan Zhou, Weipeng Chen
Title: Baichuan-Omni-1.5 Technical Report
Abstract:
We introduce Baichuan-Omni-1.5, an omni-modal model that not only has omni-modal understanding capabilities but also provides end-to-end audio generation capabilities. To achieve fluent and high-quality interaction across modalities without compromising the capabilities of any modality, we prioritized optimizing three key aspects. First, we establish a comprehensive data cleaning and synthesis pipeline for multimodal data, obtaining about 500B high-quality data (text, audio, and vision). Second, an audio-tokenizer (Baichuan-Audio-Tokenizer) has been designed to capture both semantic and acoustic information from audio, enabling seamless integration and enhanced compatibility with MLLM. Lastly, we designed a multi-stage training strategy that progressively integrates multimodal alignment and multitask fine-tuning, ensuring effective synergy across all modalities. Baichuan-Omni-1.5 leads contemporary models (including GPT4o-mini and MiniCPM-o 2.6) in terms of comprehensive omni-modal capabilities. Notably, it achieves results comparable to leading models such as Qwen2-VL-72B across various multimodal medical benchmarks.
中文: Baichuan-Omni-1.5 是一款全模态模型,具备全模态理解与端到端音频生成能力,通过优化数据处理、音频分词和多阶段训练策略,在多模态基准测试中表现卓越。
English: Baichuan-Omni-1.5 is an omni-modal model with comprehensive multimodal understanding and end-to-end audio generation capabilities, excelling in multimodal benchmarks through optimized data processing, audio tokenization, and progressive training strategies.

Authors:Ang Lv, Ruobing Xie, Yining Qian, Songhao Wu, Xingwu Sun, Zhanhui Kang, Di Wang, Rui Yan
Title: Autonomy-of-Experts Models
Abstract:
Mixture-of-Experts (MoE) models mostly use a router to assign tokens to specific expert modules, activating only partial parameters and often outperforming dense models. We argue that the separation between the router's decision-making and the experts' execution is a critical yet overlooked issue, leading to suboptimal expert selection and ineffective learning. To address this, we propose Autonomy-of-Experts (AoE), a novel MoE paradigm in which experts autonomously select themselves to process inputs. AoE is based on the insight that an expert is aware of its own capacity to effectively process a token, an awareness reflected in the scale of its internal activations. In AoE, routers are removed; instead, experts pre-compute internal activations for inputs and are ranked based on their activation norms. Only the top-ranking experts proceed with the forward pass, while the others abort. The overhead of pre-computing activations is reduced through a low-rank weight factorization. This self-evaluating-then-partner-comparing approach ensures improved expert selection and effective learning. We pre-train language models having 700M up to 4B parameters, demonstrating that AoE outperforms traditional MoE models with comparable efficiency.
中文: 提出的专家自主(AoE)范式通过专家根据激活规模自主选择处理输入,消除了路由器的需求,不仅提高了专家选择的准确性和学习效率,还通过低秩分解保持了计算经济性。
English: The proposed Autonomy-of-Experts (AoE) paradigm eliminates routers by enabling experts to self-select based on their activation norms, improving selection accuracy and learning efficiency while maintaining computational economy through low-rank factorization.

Authors:Yawen Zheng, Hanjia Lyu, Jiebo Luo
Title: Irony in Emojis: A Comparative Study of Human and LLM Interpretation
Abstract:
Emojis have become a universal language in online communication, often carrying nuanced and context-dependent meanings. Among these, irony poses a significant challenge for Large Language Models (LLMs) due to its inherent incongruity between appearance and intent. This study examines the ability of GPT-4o to interpret irony in emojis. By prompting GPT-4o to evaluate the likelihood of specific emojis being used to express irony on social media and comparing its interpretations with human perceptions, we aim to bridge the gap between machine and human understanding. Our findings reveal nuanced insights into GPT-4o's interpretive capabilities, highlighting areas of alignment with and divergence from human behavior. Additionally, this research underscores the importance of demographic factors, such as age and gender, in shaping emoji interpretation and evaluates how these factors influence GPT-4o's performance.
中文: 本研究通过对比GPT-4o与人类对表情符号反讽含义的解读,评估了其理解能力,揭示了双方认知的一致性与差异性,并强调了人口统计因素对解读的影响。
English: This study evaluates GPT-4o's ability to interpret irony in emojis by comparing its assessments with human perceptions, revealing both alignments and divergences while highlighting the influence of demographic factors on interpretation.

Authors:Dhruv Parikh, Jacob Fein-Ashley, Tian Ye, Rajgopal Kannan, Viktor Prasanna
Title: ClusterViG: Efficient Globally Aware Vision GNNs via Image Partitioning
Abstract:
Convolutional Neural Networks (CNN) and Vision Transformers (ViT) have dominated the field of Computer Vision (CV). Graph Neural Networks (GNN) have performed remarkably well across diverse domains because they can represent complex relationships via unstructured graphs. However, the applicability of GNNs for visual tasks was unexplored till the introduction of Vision GNNs (ViG). Despite the success of ViGs, their performance is severely bottlenecked due to the expensive $k$-Nearest Neighbors ($k$-NN) based graph construction. Recent works addressing this bottleneck impose constraints on the flexibility of GNNs to build unstructured graphs, undermining their core advantage while introducing additional inefficiencies. To address these issues, in this paper, we propose a novel method called Dynamic Efficient Graph Convolution (DEGC) for designing efficient and globally aware ViGs. DEGC partitions the input image and constructs graphs in parallel for each partition, improving graph construction efficiency. Further, DEGC integrates local intra-graph and global inter-graph feature learning, enabling enhanced global context awareness. Using DEGC as a building block, we propose a novel CNN-GNN architecture, ClusterViG, for CV tasks. Extensive experiments indicate that ClusterViG reduces end-to-end inference latency for vision tasks by up to $5\times$ when compared against a suite of models such as ViG, ViHGNN, PVG, and GreedyViG, with a similar model parameter count. Additionally, ClusterViG reaches state-of-the-art performance on image classification, object detection, and instance segmentation tasks, demonstrating the effectiveness of the proposed globally aware learning strategy. Finally, input partitioning performed by DEGC enables ClusterViG to be trained efficiently on higher-resolution images, underscoring the scalability of our approach.
中文: 本文提出动态高效图卷积(DEGC)方法,通过并行图构建和局部-全局特征融合解决视觉图神经网络的计算瓶颈,并基于此开发出ClusterViG模型,在显著降低推理延迟的同时实现了多项视觉任务的顶尖性能。
English: This paper introduces Dynamic Efficient Graph Convolution (DEGC) to overcome the computational bottleneck in Vision GNNs by enabling efficient parallel graph construction and integrating local and global feature learning, leading to the ClusterViG model that achieves state-of-the-art performance with significantly reduced inference latency.

Authors:Lingzhi Yuan, Xinfeng Li, Chejian Xu, Guanhong Tao, Xiaojun Jia, Yihao Huang, Wei Dong, Yang Liu, Bo Li
Title: PromptGuard: Soft Prompt-Guided Unsafe Content Moderation for Text-to-Image Models
Abstract:
Recent text-to-image (T2I) models have exhibited remarkable performance in generating high-quality images from text descriptions. However, these models are vulnerable to misuse, particularly generating not-safe-for-work (NSFW) content, such as sexually explicit, violent, political, and disturbing images, raising serious ethical concerns. In this work, we present PromptGuard, a novel content moderation technique that draws inspiration from the system prompt mechanism in large language models (LLMs) for safety alignment. Unlike LLMs, T2I models lack a direct interface for enforcing behavioral guidelines. Our key idea is to optimize a safety soft prompt that functions as an implicit system prompt within the T2I model's textual embedding space. This universal soft prompt (P*) directly moderates NSFW inputs, enabling safe yet realistic image generation without altering the inference efficiency or requiring proxy models. We further enhance its reliability and helpfulness through a divide-and-conquer strategy, which optimizes category-specific soft prompts and combines them into holistic safety guidance. Extensive experiments across five datasets demonstrate that PromptGuard effectively mitigates NSFW content generation while preserving high-quality benign outputs. PromptGuard achieves 3.8 times faster than prior content moderation methods, surpassing eight state-of-the-art defenses with an optimal unsafe ratio down to 5.84%.
中文摘要:PromptGuard提出了一种新颖的内容审核技术,通过优化安全软提示来有效防止文本到图像模型生成不良内容,同时保持高质量良性输出和推理效率。
English Summary: PromptGuard introduces an innovative content moderation technique using optimized safety soft prompts to effectively prevent text-to-image models from generating NSFW content while maintaining high-quality benign outputs and inference efficiency.

Authors:Yuanhong Chen, Kazuki Shimada, Christian Simon, Yukara Ikemiya, Takashi Shibuya, Yuki Mitsufuji
Title: CCStereo: Audio-Visual Contextual and Contrastive Learning for Binaural Audio Generation
Abstract:
Binaural audio generation (BAG) aims to convert monaural audio to stereo audio using visual prompts, requiring a deep understanding of spatial and semantic information. However, current models risk overfitting to room environments and lose fine-grained spatial details. In this paper, we propose a new audio-visual binaural generation model incorporating an audio-visual conditional normalisation layer that dynamically aligns the mean and variance of the target difference audio features using visual context, along with a new contrastive learning method to enhance spatial sensitivity by mining negative samples from shuffled visual features. We also introduce a cost-efficient way to utilise test-time augmentation in video data to enhance performance. Our approach achieves state-of-the-art generation accuracy on the FAIR-Play and MUSIC-Stereo benchmarks.
Chinese: 本文提出了一种新颖的视听双耳音频生成模型,通过动态特征对齐和对比学习解决环境过拟合问题并保留空间细节,在基准数据集上取得了最优性能。
English: This paper introduces a novel audio-visual binaural generation model that employs dynamic feature alignment and contrastive learning to overcome overfitting and preserve spatial details, achieving state-of-the-art results on benchmark datasets.

Authors:Xiwen Chen, Peijie Qiu, Wenhui Zhu, Huayu Li, Hao Wang, Aristeidis Sotiras, Yalin Wang, Abolfazl Razi
Title: Sequence Complementor: Complementing Transformers For Time Series Forecasting with Learnable Sequences
Abstract:
Since its introduction, the transformer has shifted the development trajectory away from traditional models (e.g., RNN, MLP) in time series forecasting, which is attributed to its ability to capture global dependencies within temporal tokens. Follow-up studies have largely involved altering the tokenization and self-attention modules to better adapt Transformers for addressing special challenges like non-stationarity, channel-wise dependency, and variable correlation in time series. However, we found that the expressive capability of sequence representation is a key factor influencing Transformer performance in time forecasting after investigating several representative methods, where there is an almost linear relationship between sequence representation entropy and mean square error, with more diverse representations performing better. In this paper, we propose a novel attention mechanism with Sequence Complementors and prove feasible from an information theory perspective, where these learnable sequences are able to provide complementary information beyond current input to feed attention. We further enhance the Sequence Complementors via a diversification loss that is theoretically covered. The empirical evaluation of both long-term and short-term forecasting has confirmed its superiority over the recent state-of-the-art methods.
中文: 该研究发现序列表示的多样性是影响Transformer在时间序列预测中性能的关键因素,并提出了带有序列补充器的新型注意力机制,通过理论支持的多样化损失进行增强,实证评估在长短期预测中均显示出优于现有先进方法的性能。
English: The study identifies sequence representation diversity as crucial for Transformer performance in time series forecasting and introduces a novel attention mechanism with Sequence Complementors, enhanced by a diversification loss, which demonstrates superior forecasting accuracy in empirical evaluations.

Authors:Xingwu Sun, Shuaipeng Li, Ruobing Xie, Weidong Han, Kan Wu, Zhen Yang, Yixing Li, An Wang, Shuai Li, Jinbao Xue, Yu Cheng, Yangyu Tao, Zhanhui Kang, Chengzhong Xu, Di Wang, Jie Jiang
Title: Scaling Laws for Floating Point Quantization Training
Abstract:
Low-precision training is considered an effective strategy for reducing both training and downstream inference costs. Previous scaling laws for precision mainly focus on integer quantization, which pay less attention to the constituents in floating-point (FP) quantization, and thus cannot well fit the LLM losses in this scenario. In contrast, while FP quantization training is more commonly implemented in production, it's research has been relatively superficial. In this paper, we thoroughly explore the effects of FP quantization targets, exponent bits, mantissa bits, and the calculation granularity of the scaling factor in FP quantization training performance of LLM models. In addition to an accurate FP quantization unified scaling law, we also provide valuable suggestions for the community: (1) Exponent bits contribute slightly more to the model performance than mantissa bits. We provide the optimal exponent-mantissa bit ratio for different bit numbers, which is available for future reference by hardware manufacturers; (2) We discover the formation of the critical data size in low-precision LLM training. Too much training data exceeding the critical data size will inversely bring in degradation of LLM performance; (3) The optimal FP quantization precision is directly proportional to the computational power, but within a wide computational power range. We estimate that the best cost-performance precision should lie between 4-8 bits.
中文摘要:本研究深入探索了LLM浮点量化训练,发现指数位比尾数位对性能影响略大,确定了超出临界数据量会导致性能下降的关键现象,并指出4-8比特为最佳性价比精度范围。
English Summary: This study thoroughly investigates floating-point quantization for LLMs, revealing that exponent bits slightly outweigh mantissa bits in importance, identifying a critical data size beyond which performance degrades, and determining the optimal precision range to be 4-8 bits for cost-effective training.

Authors:Ziqi Liang, Xulong Zhang, Chang Liu, Xiaoyang Qu, Weifeng Zhao, Jianzong Wang
Title: CycleFlow: Leveraging Cycle Consistency in Flow Matching for Speaker Style Adaptation
Abstract:
Voice Conversion (VC) aims to convert the style of a source speaker, such as timbre and pitch, to the style of any target speaker while preserving the linguistic content. However, the ground truth of the converted speech does not exist in a non-parallel VC scenario, which induces the train-inference mismatch problem. Moreover, existing methods still have an inaccurate pitch and low speaker adaptation quality, there is a significant disparity in pitch between the source and target speaker style domains. As a result, the models tend to generate speech with hoarseness, posing challenges in achieving high-quality voice conversion. In this study, we propose CycleFlow, a novel VC approach that leverages cycle consistency in conditional flow matching (CFM) for speaker timbre adaptation training on non-parallel data. Furthermore, we design a Dual-CFM based on VoiceCFM and PitchCFM to generate speech and improve speaker pitch adaptation quality. Experiments show that our method can significantly improve speaker similarity, generating natural and higher-quality speech.
中文: 本研究提出CycleFlow,一种基于循环一致性条件流匹配的新型语音转换方法,通过双流设计在非平行数据上提升音色和音高适应能力,显著增强了语音的自然度与质量。
English: This study introduces CycleFlow, a novel voice conversion method using cycle-consistent conditional flow matching to enhance speaker timbre and pitch adaptation on non-parallel data, significantly improving speech naturalness and quality.

Authors:Peiliang Gong, Mohamed Ragab, Min Wu, Zhenghua Chen, Yongyi Su, Xiaoli Li, Daoqiang Zhang
Title: Augmented Contrastive Clustering with Uncertainty-Aware Prototyping for Time Series Test Time Adaptation
Abstract:
Test-time adaptation aims to adapt pre-trained deep neural networks using solely online unlabelled test data during inference. Although TTA has shown promise in visual applications, its potential in time series contexts remains largely unexplored. Existing TTA methods, originally designed for visual tasks, may not effectively handle the complex temporal dynamics of real-world time series data, resulting in suboptimal adaptation performance. To address this gap, we propose Augmented Contrastive Clustering with Uncertainty-aware Prototyping (ACCUP), a straightforward yet effective TTA method for time series data. Initially, our approach employs augmentation ensemble on the time series data to capture diverse temporal information and variations, incorporating uncertainty-aware prototypes to distill essential characteristics. Additionally, we introduce an entropy comparison scheme to selectively acquire more confident predictions, enhancing the reliability of pseudo labels. Furthermore, we utilize augmented contrastive clustering to enhance feature discriminability and mitigate error accumulation from noisy pseudo labels, promoting cohesive clustering within the same class while facilitating clear separation between different classes. Extensive experiments conducted on three real-world time series datasets and an additional visual dataset demonstrate the effectiveness and generalization potential of the proposed method, advancing the underexplored realm of TTA for time series data.
中文: 针对时间序列数据在测试时自适应领域研究不足的问题,我们提出ACCUP方法,通过增强对比聚类和不确定性感知原型来提升自适应性能与特征区分度,并在多个数据集上验证了其有效性。
English: Test-time adaptation for time series data remains underexplored, so we propose ACCUP, a method using augmented contrastive clustering with uncertainty-aware prototypes to enhance adaptation performance and feature discriminability, validated across multiple datasets.

Authors:Yejing Wang, Chi Zhang, Xiangyu Zhao, Qidong Liu, Maolin Wang, Xuetao Wei, Zitao Liu, Xing Shi, Xudong Yang, Ling Zhong, Wei Lin
Title: Behavior Modeling Space Reconstruction for E-Commerce Search
Abstract:
Delivering superior search services is crucial for enhancing customer experience and driving revenue growth. Conventionally, search systems model user behaviors by combining user preference and query item relevance statically, often through a fixed logical 'and' relationship. This paper reexamines existing approaches through a unified lens using both causal graphs and Venn diagrams, uncovering two prevalent yet significant issues: entangled preference and relevance effects, and a collapsed modeling space. To surmount these challenges, our research introduces a novel framework, DRP, which enhances search accuracy through two components to reconstruct the behavior modeling space. Specifically, we implement preference editing to proactively remove the relevance effect from preference predictions, yielding untainted user preferences. Additionally, we employ adaptive fusion, which dynamically adjusts fusion criteria to align with the varying patterns of relevance and preference, facilitating more nuanced and tailored behavior predictions within the reconstructed modeling space. Empirical validation on two public datasets and a proprietary search dataset underscores the superiority of our proposed methodology, demonstrating marked improvements in performance over existing approaches.
Chinese: 本文提出DRP框架,通过偏好编辑和自适应融合解决搜索系统中偏好与相关性效应纠缠的问题,从而提升搜索准确性和性能表现。
English: This paper introduces the DRP framework to address issues of entangled preference and relevance effects in search systems by using preference editing and adaptive fusion to enhance accuracy and performance.

Authors:Kaiyu Li, Xiangyong Cao, Yupeng Deng, Chao Pang, Zepeng Xin, Deyu Meng, Zhi Wang
Title: DynamicEarth: How Far are We from Open-Vocabulary Change Detection?
Abstract:
Monitoring Earth's evolving land covers requires methods capable of detecting changes across a wide range of categories and contexts. Existing change detection methods are hindered by their dependency on predefined classes, reducing their effectiveness in open-world applications. To address this issue, we introduce open-vocabulary change detection (OVCD), a novel task that bridges vision and language to detect changes across any category. Considering the lack of high-quality data and annotation, we propose two training-free frameworks, M-C-I and I-M-C, which leverage and integrate off-the-shelf foundation models for the OVCD task. The insight behind the M-C-I framework is to discover all potential changes and then classify these changes, while the insight of I-M-C framework is to identify all targets of interest and then determine whether their states have changed. Based on these two frameworks, we instantiate to obtain several methods, e.g., SAM-DINOv2-SegEarth-OV, Grounding-DINO-SAM2-DINO, etc. Extensive evaluations on 5 benchmark datasets demonstrate the superior generalization and robustness of our OVCD methods over existing supervised and unsupervised methods. To support continued exploration, we release DynamicEarth, a dedicated codebase designed to advance research and application of OVCD. https://likyoo.github.io/DynamicEarth
中文: 为解决传统变化检测方法依赖预定义类别的问题,本研究提出开放词汇变化检测(OVCD)任务,并开发了无需训练的M-C-I和I-M-C框架,通过整合基础模型在多个基准测试中展现出卓越的泛化能力和鲁棒性。
English: To overcome the limitations of predefined classes in change detection, this study introduces open-vocabulary change detection (OVCD) and proposes two training-free frameworks, M-C-I and I-M-C, which integrate foundation models to achieve superior generalization and robustness across diverse benchmarks.

Authors:Samantha Min Er Yew, Xiaofeng Lei, Jocelyn Hui Lin Goh, Yibing Chen, Sahana Srinivasan, Miao-li Chee, Krithi Pushpanathan, Ke Zou, Qingshan Hou, Zhi Da Soh, Cancan Xue, Marco Chak Yan Yu, Charumathi Sabanayagam, E Shyong Tai, Xueling Sim, Yaxing Wang, Jost B. Jonas, Vinay Nangia, Gabriel Dawei Yang, Emma Anran Ran, Carol Yim-Lui Cheung, Yangqin Feng, Jun Zhou, Rick Siow Mong Goh, Yukun Zhou, Pearse A. Keane, Yong Liu, Ching-Yu Cheng, Yih-Chung Tham
Title: Are Traditional Deep Learning Model Approaches as Effective as a Retinal-Specific Foundation Model for Ocular and Systemic Disease Detection?
Abstract:
Background: RETFound, a self-supervised, retina-specific foundation model (FM), showed potential in downstream applications. However, its comparative performance with traditional deep learning (DL) models remains incompletely understood. This study aimed to evaluate RETFound against three ImageNet-pretrained supervised DL models (ResNet50, ViT-base, SwinV2) in detecting ocular and systemic diseases. Methods: We fine-tuned/trained RETFound and three DL models on full datasets, 50%, 20%, and fixed sample sizes (400, 200, 100 images, with half comprising disease cases; for each DR severity class, 100 and 50 cases were used. Fine-tuned models were tested internally using the SEED (53,090 images) and APTOS-2019 (3,672 images) datasets and externally validated on population-based (BES, CIEMS, SP2, UKBB) and open-source datasets (ODIR-5k, PAPILA, GAMMA, IDRiD, MESSIDOR-2). Model performance was compared using area under the receiver operating characteristic curve (AUC) and Z-tests with Bonferroni correction (P<0.05/3). Interpretation: Traditional DL models are mostly comparable to RETFound for ocular disease detection with large datasets. However, RETFound is superior in systemic disease detection with smaller datasets. These findings offer valuable insights into the respective merits and limitation of traditional models and FMs.
中文:RETFound在大型数据集上检测眼部疾病时与传统深度学习模型表现相当,但在小型数据集上检测全身性疾病时更优,这揭示了两种方法各自的优势与局限。
English: RETFound demonstrates comparable performance to traditional deep learning models for ocular disease detection with large datasets but shows superiority in systemic disease detection when using smaller datasets, highlighting the strengths and limitations of both approaches.

Authors:Hangyu Liu, Bo Peng, Can Cui, Pengxiang Ding, Donglin Wang
Title: Enhancing Adversarial Transferability via Component-Wise Transformation
Abstract:
Deep Neural Networks (DNNs) are highly vulnerable to adversarial examples, which pose significant challenges in security-sensitive applications. Among various adversarial attack strategies, input transformation-based attacks have demonstrated remarkable effectiveness in enhancing adversarial transferability. However, existing methods still perform poorly across different architectures, even though they have achieved promising results within the same architecture. This limitation arises because, while models of the same architecture may focus on different regions of the object, the variation is even more pronounced across different architectures. Unfortunately, current approaches fail to effectively guide models to attend to these diverse regions. To address this issue, this paper proposes a novel input transformation-based attack method, termed Component-Wise Transformation (CWT). CWT applies interpolation and selective rotation to individual image blocks, ensuring that each transformed image highlights different target regions, thereby improving the transferability of adversarial examples. Extensive experiments on the standard ImageNet dataset show that CWT consistently outperforms state-of-the-art methods in both attack success rates and stability across CNN- and Transformer-based models.
中文: 本文提出组件级变换(CWT)这一新颖的基于输入变换的攻击方法,通过对图像块分别进行插值和选择性旋转来增强对抗样本的迁移性,在不同模型架构上均能稳定超越现有最优方法。
English: This paper introduces Component-Wise Transformation (CWT), a novel input transformation-based attack method that enhances adversarial transferability by applying interpolation and selective rotation to individual image blocks, consistently outperforming state-of-the-art methods across different model architectures.

Authors:Haotian Xu, Xing Wu, Weinong Wang, Zhongzhi Li, Da Zheng, Boyuan Chen, Yi Hu, Shijia Kang, Jiaming Ji, Yingying Zhang, Zhijiang Guo, Yaodong Yang, Muhan Zhang, Debing Zhang
Title: RedStar: Does Scaling Long-CoT Data Unlock Better Slow-Reasoning Systems?
Abstract:
Can scaling transform reasoning? In this work, we explore the untapped potential of scaling Long Chain-of-Thought (Long-CoT) data to 1000k samples, pioneering the development of a slow-thinking model, RedStar. Through extensive experiments with various LLMs and different sizes, we uncover the ingredients for specialization and scale for Long-CoT training. Surprisingly, even smaller models show significant performance gains with limited data, revealing the sample efficiency of Long-CoT and the critical role of sample difficulty in the learning process. Our findings demonstrate that Long-CoT reasoning can be effectively triggered with just a few thousand examples, while larger models achieve unparalleled improvements. We also introduce reinforcement learning (RL)-scale training as a promising direction for advancing slow-thinking systems. RedStar shines across domains: on the MATH-Hard benchmark, RedStar-code-math boosts performance from 66.2\% to 81.6\%, and on the USA Math Olympiad (AIME), it solves 46.7\% of problems using only 21k mixed-code-math datasets. In multimodal tasks like GeoQA and MathVista-GEO, RedStar-Geo achieves competitive results with minimal Long-CoT data, outperforming other slow-thinking systems like QvQ-Preview. Compared to QwQ, RedStar strikes the perfect balance between reasoning and generalizability. Our work highlights that, with careful tuning, scaling Long-CoT can unlock extraordinary reasoning capabilities-even with limited dataset and set a new standard for slow-thinking models across diverse challenges. Our data and models are released at https://huggingface.co/RedStar-Reasoning.
中文: 通过将长思维链数据扩展至1000k样本,RedStar慢思考模型在数学和多模态任务中展现出卓越推理能力,即使数据有限,也能利用样本效率和难度实现性能突破。
English: Scaling Long Chain-of-Thought data to 1000k samples enables the development of RedStar, a slow-thinking model that significantly enhances reasoning across math and multimodal tasks, even with limited data, by leveraging sample efficiency and difficulty.

Authors:Cheng Liu, Hui Wang, Jinghua Zhao, Shiwan Zhao, Hui Bu, Xin Xu, Jiaming Zhou, Haoqin Sun, Yong Qin
Title: MusicEval: A Generative Music Dataset with Expert Ratings for Automatic Text-to-Music Evaluation
Abstract:
The technology for generating music from textual descriptions has seen rapid advancements. However, evaluating text-to-music (TTM) systems remains a significant challenge, primarily due to the difficulty of balancing performance and cost with existing objective and subjective evaluation methods. In this paper, we propose an automatic assessment task for TTM models to align with human perception. To address the TTM evaluation challenges posed by the professional requirements of music evaluation and the complexity of the relationship between text and music, we collect MusicEval, the first generative music assessment dataset. This dataset contains 2,748 music clips generated by 31 advanced and widely used models in response to 384 text prompts, along with 13,740 ratings from 14 music experts. Furthermore, we design a CLAP-based assessment model built on this dataset, and our experimental results validate the feasibility of the proposed task, providing a valuable reference for future development in TTM evaluation. The dataset is available at https://www.aishelltech.com/AISHELL_7A.
中文: 本文提出了首个用于自动评估文本生成音乐模型的MusicEval数据集,包含专家评分的音乐片段和基于CLAP的评估模型,使评估结果更符合人类感知。
English: This paper introduces MusicEval, the first dataset for automatically evaluating text-to-music generation models, featuring expert-rated music clips and a CLAP-based assessment model to align evaluations with human perception.

Authors:Yanfan Zhu, Issac Lyngaas, Murali Gopalakrishnan Meena, Mary Ellen I. Koran, Bradley Malin, Daniel Moyer, Shunxing Bao, Anuj Kapadia, Xiao Wang, Bennett Landman, Yuankai Huo
Title: Scale-up Unlearnable Examples Learning with High-Performance Computing
Abstract:
Recent advancements in AI models are structured to retain user interactions, which could inadvertently include sensitive healthcare data. In the healthcare field, particularly when radiologists use AI-driven diagnostic tools hosted on online platforms, there is a risk that medical imaging data may be repurposed for future AI training without explicit consent, spotlighting critical privacy and intellectual property concerns around healthcare data usage. Addressing these privacy challenges, a novel approach known as Unlearnable Examples (UEs) has been introduced, aiming to make data unlearnable to deep learning models. A prominent method within this area, called Unlearnable Clustering (UC), has shown improved UE performance with larger batch sizes but was previously limited by computational resources. To push the boundaries of UE performance with theoretically unlimited resources, we scaled up UC learning across various datasets using Distributed Data Parallel (DDP) training on the Summit supercomputer. Our goal was to examine UE efficacy at high-performance computing (HPC) levels to prevent unauthorized learning and enhance data security, particularly exploring the impact of batch size on UE's unlearnability. Utilizing the robust computational capabilities of the Summit, extensive experiments were conducted on diverse datasets such as Pets, MedMNist, Flowers, and Flowers102. Our findings reveal that both overly large and overly small batch sizes can lead to performance instability and affect accuracy. However, the relationship between batch size and unlearnability varied across datasets, highlighting the necessity for tailored batch size strategies to achieve optimal data protection. Our results underscore the critical role of selecting appropriate batch sizes based on the specific characteristics of each dataset to prevent learning and ensure data security in deep learning applications.
中文摘要:近期人工智能发展可能泄露医疗敏感数据,为此采用不可学习样本技术,并通过超级计算机的分布式并行训练优化批次大小,针对不同数据集实现最佳数据保护效果。
English Summary: Recent AI advancements risk exposing sensitive healthcare data, prompting the use of Unlearnable Examples enhanced by supercomputer-powered Distributed Data Parallel training to optimize batch sizes for robust data protection across diverse datasets.

Authors:Ilias Diakonikolas, Daniel M. Kane, Sihan Liu, Thanasis Pittas
Title: Entangled Mean Estimation in High-Dimensions
Abstract:
We study the task of high-dimensional entangled mean estimation in the subset-of-signals model. Specifically, given $N$ independent random points $x_1,\ldots,x_N$ in $\mathbb{R}^D$ and a parameter $α\in (0, 1)$ such that each $x_i$ is drawn from a Gaussian with mean $μ$ and unknown covariance, and an unknown $α$-fraction of the points have identity-bounded covariances, the goal is to estimate the common mean $μ$. The one-dimensional version of this task has received significant attention in theoretical computer science and statistics over the past decades. Recent work [LY20; CV24] has given near-optimal upper and lower bounds for the one-dimensional setting. On the other hand, our understanding of even the information-theoretic aspects of the multivariate setting has remained limited. In this work, we design a computationally efficient algorithm achieving an information-theoretically near-optimal error. Specifically, we show that the optimal error (up to polylogarithmic factors) is $f(α,N) + \sqrt{D/(αN)}$, where the term $f(α,N)$ is the error of the one-dimensional problem and the second term is the sub-Gaussian error rate. Our algorithmic approach employs an iterative refinement strategy, whereby we progressively learn more accurate approximations $\hat μ$ to $μ$. This is achieved via a novel rejection sampling procedure that removes points significantly deviating from $\hat μ$, as an attempt to filter out unusually noisy samples. A complication that arises is that rejection sampling introduces bias in the distribution of the remaining points. To address this issue, we perform a careful analysis of the bias, develop an iterative dimension-reduction strategy, and employ a novel subroutine inspired by list-decodable learning that leverages the one-dimensional result.
本文提出了一种高效算法,通过迭代优化和偏差校正的拒绝采样方法,在高维纠缠均值估计中实现了接近理论最优的误差率。
This paper presents a computationally efficient algorithm for high-dimensional entangled mean estimation that achieves near-optimal error rates through iterative refinement and bias-corrected rejection sampling.

Authors:Hantao Lou, Jiaming Ji, Kaile Wang, Yaodong Yang
Title: Stream Aligner: Efficient Sentence-Level Alignment via Distribution Induction
Abstract:
The rapid advancement of large language models (LLMs) has led to significant improvements in their capabilities, but also to increased concerns about their alignment with human values and intentions. Current alignment strategies, including adaptive training and inference-time methods, have demonstrated potential in this area. However, these approaches still struggle to balance deployment complexity and capability across various tasks and difficulties. In this work, we introduce the Streaming Distribution Induce Aligner (Stream Aligner), a novel alignment paradigm that combines efficiency with enhanced performance in various tasks throughout the generation process. Stream Aligner achieves dynamic sentence-level correction by using a small model to learn the preferences of the suffix sentence, iteratively correcting the suffix sentence output by the upstream model, and then using the corrected sentence to replace the suffix sentence in subsequent generations. Compared to Aligner, our experiments demonstrate that Stream Aligner reduces reliance on the capabilities of additional models, enhances the reasoning abilities of LLMs, and decreases latency during user interaction. Specifically, Stream Aligner-2B model has achieved an improvement of 76.1% in helpfulness, 36.0% in harmlessness on the tested Llama2-70B-chat model, and Stream Aligner-8B has achieved an improvement of 3.5% on the math ability of the tested Llama3-70B-Instruct model.
中文:提出的Stream Aligner是一种高效的校准范式,能在生成过程中动态修正句子输出,提升大语言模型的推理能力、安全性和实用性,同时降低延迟并减少对外部模型的依赖。
English: The proposed Stream Aligner is an efficient alignment paradigm that dynamically corrects sentence outputs during generation, enhancing LLMs' reasoning, safety, and helpfulness while reducing latency and reliance on external models.

Authors:Yewei Song, Xunzhu Tang, Cedric Lothritz, Saad Ezzini, Jacques Klein, Tegawendé F. Bissyandé, Andrey Boytsov, Ulrick Ble, Anne Goujon
Title: CallNavi, A Challenge and Empirical Study on LLM Function Calling and Routing
Abstract:
API-driven chatbot systems are increasingly integral to software engineering applications, yet their effectiveness hinges on accurately generating and executing API calls. This is particularly challenging in scenarios requiring multi-step interactions with complex parameterization and nested API dependencies. Addressing these challenges, this work contributes to the evaluation and assessment of AI-based software development through three key advancements: (1) the introduction of a novel dataset specifically designed for benchmarking API function selection, parameter generation, and nested API execution; (2) an empirical evaluation of state-of-the-art language models, analyzing their performance across varying task complexities in API function generation and parameter accuracy; and (3) a hybrid approach to API routing, combining general-purpose large language models for API selection with fine-tuned models and prompt engineering for parameter generation. These innovations significantly improve API execution in chatbot systems, offering practical methodologies for enhancing software design, testing, and operational workflows in real-world software engineering contexts.
中文: 本研究提出了用于评估API函数选择和参数生成的专用数据集,测试了语言模型在复杂API任务中的表现,并开发出结合通用大模型与微调模型的混合路由方法,显著提升了聊天机器人系统在软件工程应用中的API执行效能。
English: This research introduces a dataset for benchmarking API function selection and parameter generation, evaluates language models' performance on complex API tasks, and proposes a hybrid routing method that significantly enhances chatbot systems' API execution capabilities for software engineering applications.

Authors:Yikang Zhou, Tao Zhang, Shilin Xu, Shihao Chen, Qianyu Zhou, Yunhai Tong, Shunping Ji, Jiangning Zhang, Lu Qi, Xiangtai Li
Title: Are They the Same? Exploring Visual Correspondence Shortcomings of Multimodal LLMs
Abstract:
Recent advancements in multimodal large language models (MLLM) have shown a strong ability in visual perception, reasoning abilities, and vision-language understanding. However, the visual matching ability of MLLMs is rarely studied, despite finding the visual correspondence of objects is essential in computer vision. Our research reveals that the matching capabilities in recent MLLMs still exhibit systematic shortcomings, even with current strong MLLMs models, GPT-4o. In particular, we construct a Multimodal Visual Matching (MMVM) benchmark to fairly benchmark over 30 different MLLMs. The MMVM benchmark is built from 15 open-source datasets and Internet videos with manual annotation. We categorize the data samples of MMVM benchmark into eight aspects based on the required cues and capabilities to more comprehensively evaluate and analyze current MLLMs. In addition, we have designed an automatic annotation pipeline to generate the MMVM SFT dataset, including 220K visual matching data with reasoning annotation. To our knowledge, this is the first visual corresponding dataset and benchmark for the MLLM community. Finally, we present CoLVA, a novel contrastive MLLM with two novel technical designs: fine-grained vision expert with object-level contrastive learning and instruction augmentation strategy. The former learns instance discriminative tokens, while the latter further improves instruction following ability. CoLVA-InternVL2-4B achieves an overall accuracy (OA) of 49.80\% on the MMVM benchmark, surpassing GPT-4o and the best open-source MLLM, Qwen2VL-72B, by 7.15\% and 11.72\% OA, respectively. These results demonstrate the effectiveness of our MMVM SFT dataset and our novel technical designs. Code, benchmark, dataset, and models will be released.
中文摘要:尽管多模态大语言模型在视觉感知方面表现出色,但其视觉匹配能力存在系统性不足,为此我们构建了MMVM基准并开发了CoLVA模型,该模型在基准测试中以49.80%的准确率超越GPT-4o达7.15%。
English Summary: Recent multimodal large language models excel in visual perception but lack robust visual matching capabilities, prompting the creation of the MMVM benchmark and CoLVA model, which outperforms leading models like GPT-4o by over 7% in accuracy.

Authors:Danni Peng, Yuan Wang, Huazhu Fu, Jinpeng Jiang, Yong Liu, Rick Siow Mong Goh, Qingsong Wei
Title: Look Back for More: Harnessing Historical Sequential Updates for Personalized Federated Adapter Tuning
Abstract:
Personalized federated learning (PFL) studies effective model personalization to address the data heterogeneity issue among clients in traditional federated learning (FL). Existing PFL approaches mainly generate personalized models by relying solely on the clients' latest updated models while ignoring their previous updates, which may result in suboptimal personalized model learning. To bridge this gap, we propose a novel framework termed pFedSeq, designed for personalizing adapters to fine-tune a foundation model in FL. In pFedSeq, the server maintains and trains a sequential learner, which processes a sequence of past adapter updates from clients and generates calibrations for personalized adapters. To effectively capture the cross-client and cross-step relations hidden in previous updates and generate high-performing personalized adapters, pFedSeq adopts the powerful selective state space model (SSM) as the architecture of sequential learner. Through extensive experiments on four public benchmark datasets, we demonstrate the superiority of pFedSeq over state-of-the-art PFL methods.
中文: 个性化联邦学习(PFL)通过个性化模型解决数据异质性问题,提出的pFedSeq框架采用选择性状态空间模型作为序列学习器,利用客户端历史更新生成个性化适配器,在实验中优于现有方法。
English: Personalized federated learning (PFL) addresses data heterogeneity by personalizing models, and the proposed pFedSeq framework uses a sequential learner with a selective state space model to generate personalized adapters by leveraging past client updates, outperforming existing methods in experiments.

Authors:Hadi Askari, Shivanshu Gupta, Terry Tong, Fei Wang, Anshuman Chhabra, Muhao Chen
Title: Unraveling Indirect In-Context Learning Using Influence Functions
Abstract:
In this work, we introduce a novel paradigm for generalized In-Context Learning (ICL), termed Indirect In-Context Learning. In Indirect ICL, we explore demonstration selection strategies tailored for two distinct real-world scenarios: Mixture of Tasks and Noisy ICL. We systematically evaluate the effectiveness of Influence Functions (IFs) as a selection tool for these settings, highlighting the potential of IFs to better capture the informativeness of examples within the demonstration pool. For the Mixture of Tasks setting, demonstrations are drawn from 28 diverse tasks, including MMLU, BigBench, StrategyQA, and CommonsenseQA. We demonstrate that combining BertScore-Recall (BSR) with an IF surrogate model can further improve performance, leading to average absolute accuracy gains of 0.37\% and 1.45\% for 3-shot and 5-shot setups when compared to traditional ICL metrics. In the Noisy ICL setting, we examine scenarios where demonstrations might be mislabeled or have adversarial noise. Our experiments show that reweighting traditional ICL selectors (BSR and Cosine Similarity) with IF-based selectors boosts accuracy by an average of 2.90\% for Cosine Similarity and 2.94\% for BSR on noisy GLUE benchmarks. For the adversarial sub-setting, we show the utility of using IFs for task-agnostic demonstration selection for backdoor attack mitigation. Showing a 32.89\% reduction in Attack Success Rate compared to task-aware methods. In sum, we propose a robust framework for demonstration selection that generalizes beyond traditional ICL, offering valuable insights into the role of IFs for Indirect ICL.
中文: 本研究提出了间接情境学习新范式,通过运用影响函数优化混合任务和噪声场景中的示例选择策略,在提升模型准确率的同时显著增强了对抗攻击的防御能力。
English: This study introduces Indirect In-Context Learning, a novel framework that employs Influence Functions to enhance demonstration selection in mixed-task and noisy scenarios, achieving significant accuracy improvements and robustness against adversarial attacks.

Authors:Hadi Askari, Shivanshu Gupta, Terry Tong, Fei Wang, Anshuman Chhabra, Muhao Chen
Title: Unraveling Indirect In-Context Learning Using Influence Functions
Abstract:
In this work, we introduce a novel paradigm for generalized In-Context Learning (ICL), termed Indirect In-Context Learning. In Indirect ICL, we explore demonstration selection strategies tailored for two distinct real-world scenarios: Mixture of Tasks and Noisy ICL. We systematically evaluate the effectiveness of Influence Functions (IFs) as a selection tool for these settings, highlighting the potential of IFs to better capture the informativeness of examples within the demonstration pool. For the Mixture of Tasks setting, demonstrations are drawn from 28 diverse tasks, including MMLU, BigBench, StrategyQA, and CommonsenseQA. We demonstrate that combining BertScore-Recall (BSR) with an IF surrogate model can further improve performance, leading to average absolute accuracy gains of 0.37\% and 1.45\% for 3-shot and 5-shot setups when compared to traditional ICL metrics. In the Noisy ICL setting, we examine scenarios where demonstrations might be mislabeled or have adversarial noise. Our experiments show that reweighting traditional ICL selectors (BSR and Cosine Similarity) with IF-based selectors boosts accuracy by an average of 2.90\% for Cosine Similarity and 2.94\% for BSR on noisy GLUE benchmarks. For the adversarial sub-setting, we show the utility of using IFs for task-agnostic demonstration selection for backdoor attack mitigation. Showing a 32.89\% reduction in Attack Success Rate compared to task-aware methods. In sum, we propose a robust framework for demonstration selection that generalizes beyond traditional ICL, offering valuable insights into the role of IFs for Indirect ICL.
中文: 本研究提出了间接情境学习新范式,通过运用影响函数优化混合任务和噪声场景中的示例选择策略,在提升模型准确率的同时显著增强了对抗攻击的防御能力。
English: This study introduces Indirect In-Context Learning, a novel framework that employs Influence Functions to enhance demonstration selection in mixed-task and noisy scenarios, achieving significant accuracy improvements and robustness against adversarial attacks.

Authors:Simon Kohaut, Benedict Flade, Daniel Ochs, Devendra Singh Dhami, Julian Eggert, Kristian Kersting
Title: Probabilistic Mission Design in Neuro-Symbolic Systems
Abstract:
Advanced Air Mobility (AAM) is a growing field that demands accurate modeling of legal concepts and restrictions in navigating intelligent vehicles. In addition, any implementation of AAM needs to face the challenges posed by inherently dynamic and uncertain human-inhabited spaces robustly. Nevertheless, the employment of Unmanned Aircraft Systems (UAS) beyond visual line of sight (BVLOS) is an endearing task that promises to enhance significantly today's logistics and emergency response capabilities. To tackle these challenges, we present a probabilistic and neuro-symbolic architecture to encode legal frameworks and expert knowledge over uncertain spatial relations and noisy perception in an interpretable and adaptable fashion. More specifically, we demonstrate Probabilistic Mission Design (ProMis), a system architecture that links geospatial and sensory data with declarative, Hybrid Probabilistic Logic Programs (HPLP) to reason over the agent's state space and its legality. As a result, ProMis generates Probabilistic Mission Landscapes (PML), which quantify the agent's belief that a set of mission conditions is satisfied across its navigation space. Extending prior work on ProMis' reasoning capabilities and computational characteristics, we show its integration with potent machine learning models such as Large Language Models (LLM) and Transformer-based vision models. Hence, our experiments underpin the application of ProMis with multi-modal input data and how our method applies to many important AAM scenarios.
Chinese: 先进空中交通需要能够稳健处理法律框架和动态环境的系统,而概率任务设计(ProMis)架构通过整合概率逻辑和机器学习,实现了可解释的推理和任务规划,以应对这些挑战。
English: Advanced Air Mobility requires robust systems to handle legal frameworks and dynamic environments, which is addressed by the Probabilistic Mission Design (ProMis) architecture that integrates probabilistic logic and machine learning for interpretable reasoning and mission planning.

Authors:Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, Nathan Lambert, Dustin Schwenk, Oyvind Tafjord, Taira Anderson, David Atkinson, Faeze Brahman, Christopher Clark, Pradeep Dasigi, Nouha Dziri, Michal Guerquin, Hamish Ivison, Pang Wei Koh, Jiacheng Liu, Saumya Malik, William Merrill, Lester James V. Miranda, Jacob Morrison, Tyler Murray, Crystal Nam, Valentina Pyatkin, Aman Rangapur, Michael Schmitz, Sam Skjonsberg, David Wadden, Christopher Wilhelm, Michael Wilson, Luke Zettlemoyer, Ali Farhadi, Noah A. Smith, Hannaneh Hajishirzi
Title: 2 OLMo 2 Furious
Abstract:
We present OLMo 2, the next generation of our fully open language models. OLMo 2 includes dense autoregressive models with improved architecture and training recipe, pretraining data mixtures, and instruction tuning recipes. Our modified model architecture and training recipe achieve both better training stability and improved per-token efficiency. Our updated pretraining data mixture introduces a new, specialized data mix called Dolmino Mix 1124, which significantly improves model capabilities across many downstream task benchmarks when introduced via late-stage curriculum training (i.e. specialized data during the annealing phase of pretraining). Finally, we incorporate best practices from Tülu 3 to develop OLMo 2-Instruct, focusing on permissive data and extending our final-stage reinforcement learning with verifiable rewards (RLVR). Our OLMo 2 base models sit at the Pareto frontier of performance to compute, often matching or outperforming open-weight only models like Llama 3.1 and Qwen 2.5 while using fewer FLOPs and with fully transparent training data, code, and recipe. Our fully open OLMo 2-Instruct models are competitive with or surpassing open-weight only models of comparable size, including Qwen 2.5, Llama 3.1 and Gemma 2. We release all OLMo 2 artifacts openly -- models at 7B and 13B scales, both pretrained and post-trained, including their full training data, training code and recipes, training logs and thousands of intermediate checkpoints. The final instruction model is available on the Ai2 Playground as a free research demo.
中文: OLMo 2是新一代全开源语言模型系列,提供7B、13B和32B版本,其完全透明的训练数据和代码使其在减少计算资源的同时,性能上仍可与Llama 3.1和GPT-3.5 Turbo等模型相媲美。
English: OLMo 2 is a fully open-source language model family available in 7B, 13B, and 32B versions, featuring complete transparency in training data and code while achieving competitive performance with fewer computational resources compared to models like Llama 3.1 and GPT-3.5 Turbo.

Authors:Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, Nathan Lambert, Dustin Schwenk, Oyvind Tafjord, Taira Anderson, David Atkinson, Faeze Brahman, Christopher Clark, Pradeep Dasigi, Nouha Dziri, Allyson Ettinger, Michal Guerquin, David Heineman, Hamish Ivison, Pang Wei Koh, Jiacheng Liu, Saumya Malik, William Merrill, Lester James V. Miranda, Jacob Morrison, Tyler Murray, Crystal Nam, Jake Poznanski, Valentina Pyatkin, Aman Rangapur, Michael Schmitz, Sam Skjonsberg, David Wadden, Christopher Wilhelm, Michael Wilson, Luke Zettlemoyer, Ali Farhadi, Noah A. Smith, Hannaneh Hajishirzi
Title: 2 OLMo 2 Furious
Abstract:
We present OLMo 2, the next generation of our fully open language models. OLMo 2 includes a family of dense autoregressive language models at 7B, 13B and 32B scales with fully released artifacts -- model weights, full training data, training code and recipes, training logs and thousands of intermediate checkpoints. In this work, we describe our modified model architecture and training recipe, focusing on techniques for achieving better training stability and improved per-token efficiency. Our updated pretraining data mixture introduces a new, specialized data mix called Dolmino Mix 1124, which significantly improves model capabilities across many downstream task benchmarks when introduced via late-stage curriculum training (i.e. specialized data during the annealing phase of pretraining). Finally, we incorporate best practices from Tülu 3 to develop OLMo 2-Instruct, focusing on permissive data and extending our final-stage reinforcement learning with verifiable rewards (RLVR). Our OLMo 2 base models sit at the Pareto frontier of performance to training compute, often matching or outperforming open-weight only models like Llama 3.1, Qwen 2.5, and Gemma 2 while using fewer FLOPs and with fully transparent training data, code, and recipe. Our fully open OLMo 2-Instruct models are competitive with open-weight only models of comparable size and even some proprietary models like GPT-3.5 Turbo and GPT 4o Mini.
中文: OLMo 2是新一代全开源语言模型系列,提供7B、13B和32B版本,其完全透明的训练数据和代码使其在减少计算资源的同时,性能上仍可与Llama 3.1和GPT-3.5 Turbo等模型相媲美。
English: OLMo 2 is a fully open-source language model family available in 7B, 13B, and 32B versions, featuring complete transparency in training data and code while achieving competitive performance with fewer computational resources compared to models like Llama 3.1 and GPT-3.5 Turbo.

Authors:Ilias Diakonikolas, Daniel M. Kane, Mingchen Ma
Title: Active Learning of General Halfspaces: Label Queries vs Membership Queries
Abstract:
We study the problem of learning general (i.e., not necessarily homogeneous) halfspaces under the Gaussian distribution on $R^d$ in the presence of some form of query access. In the classical pool-based active learning model, where the algorithm is allowed to make adaptive label queries to previously sampled points, we establish a strong information-theoretic lower bound ruling out non-trivial improvements over the passive setting. Specifically, we show that any active learner requires label complexity of $\tildeΩ(d/(\log(m)ε))$, where $m$ is the number of unlabeled examples. Specifically, to beat the passive label complexity of $\tilde{O} (d/ε)$, an active learner requires a pool of $2^{poly(d)}$ unlabeled samples. On the positive side, we show that this lower bound can be circumvented with membership query access, even in the agnostic model. Specifically, we give a computationally efficient learner with query complexity of $\tilde{O}(\min\{1/p, 1/ε\} + d\cdot polylog(1/ε))$ achieving error guarantee of $O(opt)+ε$. Here $p \in [0, 1/2]$ is the bias and $opt$ is the 0-1 loss of the optimal halfspace. As a corollary, we obtain a strong separation between the active and membership query models. Taken together, our results characterize the complexity of learning general halfspaces under Gaussian marginals in these models.
中文: 本文证明在高斯分布下,基于池查询的主动学习对一般半空间的标记效率无法显著超越被动学习,但成员查询可在不可知模型中实现高效学习并获得强误差保证。
English: This paper demonstrates that active learning with pool-based queries cannot significantly improve label efficiency over passive learning for general halfspaces under Gaussian distributions, but shows membership queries enable efficient agnostic learning with strong error guarantees.

Authors:Muhammad Uzair Zahid, Serkan Kiranyaz, Alper Yildirim, Moncef Gabbouj
Title: CoRe-Net: Co-Operational Regressor Network with Progressive Transfer Learning for Blind Radar Signal Restoration
Abstract:
Real-world radar signals are frequently corrupted by various artifacts, including sensor noise, echoes, interference, and intentional jamming, differing in type, severity, and duration. This pilot study introduces a novel model, called Co-Operational Regressor Network (CoRe-Net) for blind radar signal restoration, designed to address such limitations and drawbacks. CoRe-Net replaces adversarial training with a novel cooperative learning strategy, leveraging the complementary roles of its Apprentice Regressor (AR) and Master Regressor (MR). The AR restores radar signals corrupted by various artifacts, while the MR evaluates the quality of the restoration and provides immediate and task-specific feedback, ensuring stable and efficient learning. The AR, therefore, has the advantage of both self-learning and assistive learning by the MR. The proposed model has been extensively evaluated over the benchmark Blind Radar Signal Restoration (BRSR) dataset, which simulates diverse real-world artifact scenarios. Under the fair experimental setup, this study shows that the CoRe-Net surpasses the Op-GANs over a 1 dB mean SNR improvement. To further boost the performance gain, this study proposes multi-pass restoration by cascaded CoRe-Nets trained with a novel paradigm called Progressive Transfer Learning (PTL), which enables iterative refinement, thus achieving an additional 2 dB mean SNR enhancement. Multi-pass CoRe-Net training by PTL consistently yields incremental performance improvements through successive restoration passes whilst highlighting CoRe-Net ability to handle such a complex and varying blend of artifacts.
中文: 本试点研究提出CoRe-Net模型,通过学徒与主回归器的协作学习实现盲雷达信号恢复,在基准测试中取得超过1分贝信噪比提升,并采用渐进式迁移学习实现多级恢复,进一步获得2分贝性能增益。
English: This pilot study introduces CoRe-Net, a novel model using cooperative learning between apprentice and master regressors for blind radar signal restoration, achieving over 1 dB SNR improvement and further enhanced performance through multi-pass restoration with progressive transfer learning.

Authors:Cong-Duy Nguyen, Xiaobao Wu, Thong Nguyen, Shuai Zhao, Khoi Le, Viet-Anh Nguyen, Feng Yichao, Anh Tuan Luu
Title: Enhancing Multimodal Entity Linking with Jaccard Distance-based Conditional Contrastive Learning and Contextual Visual Augmentation
Abstract:
Previous research on multimodal entity linking (MEL) has primarily employed contrastive learning as the primary objective. However, using the rest of the batch as negative samples without careful consideration, these studies risk leveraging easy features and potentially overlook essential details that make entities unique. In this work, we propose JD-CCL (Jaccard Distance-based Conditional Contrastive Learning), a novel approach designed to enhance the ability to match multimodal entity linking models. JD-CCL leverages meta-information to select negative samples with similar attributes, making the linking task more challenging and robust. Additionally, to address the limitations caused by the variations within the visual modality among mentions and entities, we introduce a novel method, CVaCPT (Contextual Visual-aid Controllable Patch Transform). It enhances visual representations by incorporating multi-view synthetic images and contextual textual representations to scale and shift patch representations. Experimental results on benchmark MEL datasets demonstrate the strong effectiveness of our approach.
中文: 本文提出JD-CCL方法,通过元信息选择具有相似属性的负样本来增强多模态实体链接的鲁棒性,并引入CVaCPT技术,利用合成图像和上下文文本来优化视觉表示以应对视觉差异。
English: This paper introduces JD-CCL, a method that improves multimodal entity linking by selecting challenging negative samples using meta-information, and CVaCPT, which enhances visual representations with synthetic images and contextual text to address visual variations.

Authors:Xinhao Deng, Xiyuan Zhao, Qilei Yin, Zhuotao Liu, Qi Li, Mingwei Xu, Ke Xu, Jianping Wu
Title: Towards Robust Multi-tab Website Fingerprinting
Abstract:
Website fingerprinting enables an eavesdropper to determine which websites a user is visiting over an encrypted connection. State-of-the-art website fingerprinting (WF) attacks have demonstrated effectiveness even against Tor-protected network traffic. However, existing WF attacks have critical limitations on accurately identifying websites in multi-tab browsing sessions, where the holistic pattern of individual websites is no longer preserved, and the number of tabs opened by a client is unknown a priori. In this paper, we propose ARES, a novel WF framework natively designed for multi-tab WF attacks. ARES formulates the multi-tab attack as a multi-label classification problem and solves it using the novel Transformer-based models. Specifically, ARES extracts local patterns based on multi-level traffic aggregation features and utilizes the improved self-attention mechanism to analyze the correlations between these local patterns, effectively identifying websites. We implement a prototype of ARES and extensively evaluate its effectiveness using our large-scale datasets collected over multiple months. The experimental results illustrate that ARES achieves optimal performance in several realistic scenarios. Further, ARES remains robust even against various WF defenses.
Chinese: ARES是一种新颖的网站指纹识别框架,通过将多标签页浏览建模为多标签分类问题,并利用基于Transformer的模型分析流量模式关联,从而有效识别多标签页会话中的访问网站。
English: ARES is a novel website fingerprinting framework that uses Transformer-based models to effectively identify websites in multi-tab browsing by treating it as a multi-label classification problem and analyzing traffic pattern correlations.

Authors:Yiyang Wang, Xi Chen, Xiaogang Xu, Sihui Ji, Yu Liu, Yujun Shen, Hengshuang Zhao
Title: DiffDoctor: Diagnosing Image Diffusion Models Before Treating
Abstract:
In spite of recent progress, image diffusion models still produce artifacts. A common solution is to leverage the feedback provided by quality assessment systems or human annotators to optimize the model, where images are generally rated in their entirety. In this work, we believe problem-solving starts with identification, yielding the request that the model should be aware of not just the presence of defects in an image, but their specific locations. Motivated by this, we propose DiffDoctor, a two-stage pipeline to assist image diffusion models in generating fewer artifacts. Concretely, the first stage targets developing a robust artifact detector, for which we collect a dataset of over 1M flawed synthesized images and set up an efficient human-in-the-loop annotation process, incorporating a carefully designed class-balance strategy. The learned artifact detector is then involved in the second stage to optimize the diffusion model by providing pixel-level feedback. Extensive experiments on text-to-image diffusion models demonstrate the effectiveness of our artifact detector as well as the soundness of our diagnose-then-treat design.
中文: 本文提出DiffDoctor,采用两阶段方法先通过大规模缺陷图像数据集和人工辅助标注开发鲁棒的伪影检测器,再通过像素级反馈优化扩散模型,有效减少生成图像中的伪影。
English: This paper introduces DiffDoctor, a two-stage pipeline that first develops a robust artifact detector using a large dataset of flawed images and human-in-the-loop annotations, then optimizes diffusion models by providing pixel-level feedback to reduce artifacts in generated images.

Authors:Jiahao Huang, Jianhang Zhu, Rongpeng Li, Zhifeng Zhao, Honggang Zhang
Title: Select2Drive: Pragmatic Communications for Real-Time Collaborative Autonomous Driving
Abstract:
Vehicle-to-everything communications-assisted autonomous driving has witnessed remarkable advancements in recent years, with pragmatic communications (PragComm) emerging as a promising paradigm for real-time collaboration among vehicles and other agents. Simultaneously, extensive research has explored the interplay between collaborative perception and decision-making in end-to-end driving frameworks. In this work, we revisit the collaborative driving problem and propose the Select2Drive framework to optimize the utilization of limited computational and communication resources. Particularly, to mitigate cumulative latency in perception and decision-making, Select2Drive introduces distributed predictive perception by formulating an active prediction paradigm and simplifying high-dimensional semantic feature prediction into a computation cost-efficient, motion-aware reconstruction. Given the ``less is more" principle that an over-broadened perceptual horizon possibly confuses the decision module rather than contributing to it, Select2Drive utilizes area-of-importance-based PragComm to prioritize the communications of critical regions, thus boosting both communication efficiency and decision-making efficacy. Empirical evaluations on the V2Xverse and real-world DAIR-V2X demonstrate that Select2Drive achieves a $2.60$\% and $1.99$\% improvement in offline perception tasks under limited bandwidth (resp., pose error conditions). Moreover, it delivers at most $8.35$\% and $2.65$\% enhancement in closed-loop driving scores and route completion rates, particularly in scenarios characterized by dense traffic and high-speed dynamics.
中文: Select2Drive框架通过分布式预测感知和基于重要区域的通信优先级优化,在有限资源下提升了自动驾驶的感知性能和驾驶效率,尤其在密集交通和高速场景中表现卓越。
English: The Select2Drive framework enhances autonomous driving by optimizing computational and communication resources through distributed predictive perception and prioritized communication of critical areas, significantly improving performance in perception tasks and driving metrics under constrained conditions.

Authors:Jingyuan Chen, Fuchen Long, Jie An, Zhaofan Qiu, Ting Yao, Jiebo Luo, Tao Mei
Title: Ouroboros-Diffusion: Exploring Consistent Content Generation in Tuning-free Long Video Diffusion
Abstract:
The first-in-first-out (FIFO) video diffusion, built on a pre-trained text-to-video model, has recently emerged as an effective approach for tuning-free long video generation. This technique maintains a queue of video frames with progressively increasing noise, continuously producing clean frames at the queue's head while Gaussian noise is enqueued at the tail. However, FIFO-Diffusion often struggles to keep long-range temporal consistency in the generated videos due to the lack of correspondence modeling across frames. In this paper, we propose Ouroboros-Diffusion, a novel video denoising framework designed to enhance structural and content (subject) consistency, enabling the generation of consistent videos of arbitrary length. Specifically, we introduce a new latent sampling technique at the queue tail to improve structural consistency, ensuring perceptually smooth transitions among frames. To enhance subject consistency, we devise a Subject-Aware Cross-Frame Attention (SACFA) mechanism, which aligns subjects across frames within short segments to achieve better visual coherence. Furthermore, we introduce self-recurrent guidance. This technique leverages information from all previous cleaner frames at the front of the queue to guide the denoising of noisier frames at the end, fostering rich and contextual global information interaction. Extensive experiments of long video generation on the VBench benchmark demonstrate the superiority of our Ouroboros-Diffusion, particularly in terms of subject consistency, motion smoothness, and temporal consistency.
中文: Ouroboros-Diffusion是一种新型视频生成框架,通过潜在采样和跨帧注意力机制增强长视频的时序一致性和主体连贯性,在生成一致性长视频方面优于现有方法。
English: Ouroboros-Diffusion is a novel video generation framework that enhances long-range temporal and subject consistency through latent sampling and cross-frame attention mechanisms, outperforming existing methods in generating coherent long videos.

Authors:Yongshuo Zhu, Lu Li, Keyan Chen, Chenyang Liu, Fugen Zhou, Zhenwei Shi
Title: Semantic-CD: Remote Sensing Image Semantic Change Detection towards Open-vocabulary Setting
Abstract:
Remote sensing image semantic change detection is a method used to analyze remote sensing images, aiming to identify areas of change as well as categorize these changes within images of the same location taken at different times. Traditional change detection methods often face challenges in generalizing across semantic categories in practical scenarios. To address this issue, we introduce a novel approach called Semantic-CD, specifically designed for semantic change detection in remote sensing images. This method incorporates the open vocabulary semantics from the vision-language foundation model, CLIP. By utilizing CLIP's extensive vocabulary knowledge, our model enhances its ability to generalize across categories and improves segmentation through fully decoupled multi-task learning, which includes both binary change detection and semantic change detection tasks. Semantic-CD consists of four main components: a bi-temporal CLIP visual encoder for extracting features from bi-temporal images, an open semantic prompter for creating semantic cost volume maps with open vocabulary, a binary change detection decoder for generating binary change detection masks, and a semantic change detection decoder for producing semantic labels. Experimental results on the SECOND dataset demonstrate that Semantic-CD achieves more accurate masks and reduces semantic classification errors, illustrating its effectiveness in applying semantic priors from vision-language foundation models to SCD tasks.
Chinese: Semantic-CD提出了一种新颖的遥感图像语义变化检测方法,利用CLIP的开放词汇语义和完全解耦的多任务学习来提升类别泛化能力和分割精度,在SECOND数据集上验证了其有效性。
English: Semantic-CD introduces a novel remote sensing image change detection method that leverages CLIP's open vocabulary and multi-task learning to enhance generalization and segmentation accuracy, achieving improved performance on the SECOND dataset.

Authors:Wenxin Luo, Weirui Wang, Xiaopeng Li, Weibo Zhou, Pengyue Jia, Xiangyu Zhao
Title: TAPO: Task-Referenced Adaptation for Prompt Optimization
Abstract:
Prompt engineering can significantly improve the performance of large language models (LLMs), with automated prompt optimization (APO) gaining significant attention due to the time-consuming and laborious nature of manual prompt design. However, much of the existing work in APO overlooks task-specific characteristics, resulting in prompts that lack domain specificity and are not well-suited for task-specific optimization. In this paper, we introduce TAPO, a multitask-aware prompt optimization framework composed of three key modules. First, a task-aware metric selection module is proposed to enhance task-specific prompt generation capabilities. Second, we present a multi-metrics evaluation module to jointly evaluate prompts from multiple perspectives. Third, an evolution-based optimization framework is introduced for automatic prompt refinement, which improves adaptability across various tasks. Extensive experiments on six datasets demonstrate the effectiveness of our approach, and our code is publicly available.
中文摘要:TAPO是一个多任务感知的提示优化框架,通过任务感知指标选择、多维度评估和进化式优化,有效提升大语言模型在特定任务中的性能,并在六个数据集上验证了其优越性。
English Summary: TAPO is a multitask-aware prompt optimization framework that enhances LLM performance through task-specific metric selection, multi-perspective evaluation, and evolution-based prompt refinement, demonstrating effectiveness across six datasets.

Authors:Chen Huang, Yang Deng, Wenqiang Lei, Jiancheng Lv, Tat-Seng Chua, Jimmy Xiangji Huang
Title: How to Enable Effective Cooperation Between Humans and NLP Models: A Survey of Principles, Formalizations, and Beyond
Abstract:
With the advancement of large language models (LLMs), intelligent models have evolved from mere tools to autonomous agents with their own goals and strategies for cooperating with humans. This evolution has birthed a novel paradigm in NLP, i.e., human-model cooperation, that has yielded remarkable progress in numerous NLP tasks in recent years. In this paper, we take the first step to present a thorough review of human-model cooperation, exploring its principles, formalizations, and open challenges. In particular, we introduce a new taxonomy that provides a unified perspective to summarize existing approaches. Also, we discuss potential frontier areas and their corresponding challenges. We regard our work as an entry point, paving the way for more breakthrough research in this regard.
中文摘要:随着大语言模型发展为具有自主目标的智能体,人机协作已成为自然语言处理领域的新范式,本文通过统一分类法首次系统综述该领域并展望未来研究方向。
English Summary: The evolution of large language models into autonomous agents has established human-model cooperation as a transformative paradigm in NLP, which this paper systematically reviews through a unified taxonomy while identifying future research directions.

Authors:Zixuan Ke, Yifei Ming, Xuan-Phi Nguyen, Caiming Xiong, Shafiq Joty
Title: Demystifying Domain-adaptive Post-training for Financial LLMs
Abstract:
Domain-adaptive post-training of large language models (LLMs) has emerged as a promising approach for specialized domains such as medicine and finance. However, significant challenges remain in identifying optimal adaptation criteria and training strategies across varying data and model configurations. To address these challenges, we introduce FINDAP, a systematic and fine-grained investigation into domain adaptive post-training of LLMs for the finance domain. Our approach consists of four key components: FinCap, which defines the core capabilities required for the target domain; FinRec, an effective training recipe that jointly optimizes continual pre-training and instruction-following, along with a novel preference data distillation method leveraging process signals from a generative reward model; FinTrain, a curated set of training datasets supporting FinRec; and FinEval, a comprehensive evaluation suite aligned with FinCap. The resulting model, Llama-Fin, achieves state-of-the-art performance across a wide range of financial tasks. Our analysis also highlights how each post-training stage contributes to distinct capabilities, uncovering specific challenges and effective solutions, providing valuable insights for domain adaptation of LLMs.
中文: 本研究提出了FINDAP框架,针对金融领域的大语言模型进行领域自适应后训练,通过定义核心能力、优化训练方法、构建数据集和评估体系,开发出领先的Llama-Fin模型,并揭示了各训练阶段对能力形成的作用及有效解决方案。
English: The study introduces FINDAP, a comprehensive framework for domain-adaptive post-training of LLMs in finance, featuring specialized components for capability definition, training optimization, dataset curation, and evaluation, which produces the state-of-the-art Llama-Fin model and reveals key insights into effective adaptation strategies.

Authors:Zixuan Ke, Yifei Ming, Xuan-Phi Nguyen, Caiming Xiong, Shafiq Joty
Title: Demystifying Domain-adaptive Post-training for Financial LLMs
Abstract:
Domain-adaptive post-training of large language models (LLMs) has emerged as a promising approach for specialized domains such as medicine and finance. However, significant challenges remain in identifying optimal adaptation criteria and training strategies across varying data and model configurations. To address these challenges, we introduce FINDAP, a systematic and fine-grained investigation into domain-adaptive post-training of LLMs for the finance domain. Our approach consists of four key components: FinCap, which defines the core capabilities required for the target domain; FinRec, an effective training recipe that jointly optimizes continual pre-training and instruction-following, along with a novel preference data distillation method leveraging process signals from a generative reward model; FinTrain, a curated set of training datasets supporting FinRec; and FinEval, a comprehensive evaluation suite aligned with FinCap. The resulting model, Llama-Fin, achieves state-of-the-art performance across a wide range of financial tasks. Our analysis also highlights how each post-training stage contributes to distinct capabilities, uncovering specific challenges and effective solutions, providing valuable insights for domain adaptation of LLMs
中文: 本研究提出了FINDAP框架,针对金融领域的大语言模型进行领域自适应后训练,通过定义核心能力、优化训练方法、构建数据集和评估体系,开发出领先的Llama-Fin模型,并揭示了各训练阶段对能力形成的作用及有效解决方案。
English: The study introduces FINDAP, a comprehensive framework for domain-adaptive post-training of LLMs in finance, featuring specialized components for capability definition, training optimization, dataset curation, and evaluation, which produces the state-of-the-art Llama-Fin model and reveals key insights into effective adaptation strategies.

Authors:Kang Chen, Yajing Zheng, Tiejun Huang, Zhaofei Yu
Title: Rethinking High-speed Image Reconstruction Framework with Spike Camera
Abstract:
Spike cameras, as innovative neuromorphic devices, generate continuous spike streams to capture high-speed scenes with lower bandwidth and higher dynamic range than traditional RGB cameras. However, reconstructing high-quality images from the spike input under low-light conditions remains challenging. Conventional learning-based methods often rely on the synthetic dataset as the supervision for training. Still, these approaches falter when dealing with noisy spikes fired under the low-light environment, leading to further performance degradation in the real-world dataset. This phenomenon is primarily due to inadequate noise modelling and the domain gap between synthetic and real datasets, resulting in recovered images with unclear textures, excessive noise, and diminished brightness. To address these challenges, we introduce a novel spike-to-image reconstruction framework SpikeCLIP that goes beyond traditional training paradigms. Leveraging the CLIP model's powerful capability to align text and images, we incorporate the textual description of the captured scene and unpaired high-quality datasets as the supervision. Our experiments on real-world low-light datasets U-CALTECH and U-CIFAR demonstrate that SpikeCLIP significantly enhances texture details and the luminance balance of recovered images. Furthermore, the reconstructed images are well-aligned with the broader visual features needed for downstream tasks, ensuring more robust and versatile performance in challenging environments.
SpikeCLIP提出了一种创新的脉冲到图像重建框架,利用CLIP的文本-图像对齐能力和非配对高质量数据集,显著提升了低光环境下图像的纹理细节与亮度平衡,有效克服了传统方法的局限性。
SpikeCLIP introduces a novel spike-to-image reconstruction framework that leverages CLIP's text-image alignment and unpaired high-quality datasets to significantly enhance texture details and luminance balance in low-light conditions, overcoming the limitations of conventional methods.

Authors:Tianqi Ren, Rongpeng Li, Ming-min Zhao, Xianfu Chen, Guangyi Liu, Yang Yang, Zhifeng Zhao, Honggang Zhang
Title: Separate Source Channel Coding Is Still What You Need: An LLM-based Rethinking
Abstract:
Along with the proliferating research interest in Semantic Communication (SemCom), Joint Source Channel Coding (JSCC) has dominated the attention due to the widely assumed existence in efficiently delivering information semantics. Nevertheless, this paper challenges the conventional JSCC paradigm, and advocates for adoption of Separate Source Channel Coding (SSCC) to enjoy the underlying more degree of freedom for optimization. We demonstrate that SSCC, after leveraging the strengths of Large Language Model (LLM) for source coding and Error Correction Code Transformer (ECCT) complemented for channel decoding, offers superior performance over JSCC. Our proposed framework also effectively highlights the compatibility challenges between SemCom approaches and digital communication systems, particularly concerning the resource costs associated with the transmission of high precision floating point numbers. Through comprehensive evaluations, we establish that empowered by LLM-based compression and ECCT-enhanced error correction, SSCC remains a viable and effective solution for modern communication systems. In other words, separate source and channel coding is still what we need!
中文摘要:本文挑战传统的联合信源信道编码范式,提出采用分离式信源信道编码框架,通过大语言模型实现信源编码并结合纠错码变换器进行信道解码,不仅展现出更优性能,还解决了语义通信方法与数字通信系统的兼容性问题。
English Summary: This paper challenges the conventional Joint Source Channel Coding (JSCC) paradigm by proposing a Separate Source Channel Coding (SSCC) framework that leverages Large Language Models for source coding and Error Correction Code Transformers for channel decoding, demonstrating superior performance and compatibility with digital communication systems.

Authors:Chaoran Feng, Wangbo Yu, Xinhua Cheng, Zhenyu Tang, Junwu Zhang, Li Yuan, Yonghong Tian
Title: AE-NeRF: Augmenting Event-Based Neural Radiance Fields for Non-ideal Conditions and Larger Scene
Abstract:
Compared to frame-based methods, computational neuromorphic imaging using event cameras offers significant advantages, such as minimal motion blur, enhanced temporal resolution, and high dynamic range. The multi-view consistency of Neural Radiance Fields combined with the unique benefits of event cameras, has spurred recent research into reconstructing NeRF from data captured by moving event cameras. While showing impressive performance, existing methods rely on ideal conditions with the availability of uniform and high-quality event sequences and accurate camera poses, and mainly focus on the object level reconstruction, thus limiting their practical applications. In this work, we propose AE-NeRF to address the challenges of learning event-based NeRF from non-ideal conditions, including non-uniform event sequences, noisy poses, and various scales of scenes. Our method exploits the density of event streams and jointly learn a pose correction module with an event-based NeRF (e-NeRF) framework for robust 3D reconstruction from inaccurate camera poses. To generalize to larger scenes, we propose hierarchical event distillation with a proposal e-NeRF network and a vanilla e-NeRF network to resample and refine the reconstruction process. We further propose an event reconstruction loss and a temporal loss to improve the view consistency of the reconstructed scene. We established a comprehensive benchmark that includes large-scale scenes to simulate practical non-ideal conditions, incorporating both synthetic and challenging real-world event datasets. The experimental results show that our method achieves a new state-of-the-art in event-based 3D reconstruction.
Chinese: 本文提出AE-NeRF方法,通过结合位姿校正模块、分层事件蒸馏策略及专用损失函数,实现在非理想条件下从事件相机进行鲁棒三维重建,并在各类场景中达到最优性能。
English: This paper introduces AE-NeRF, a novel method that enables robust 3D reconstruction from event cameras under non-ideal conditions by integrating pose correction, hierarchical distillation, and specialized loss functions, achieving state-of-the-art performance across diverse scenes.

Authors:Yuanpeng Tu, Xi Chen, Ser-Nam Lim, Hengshuang Zhao
Title: DreamMask: Boosting Open-vocabulary Panoptic Segmentation with Synthetic Data
Abstract:
Open-vocabulary panoptic segmentation has received significant attention due to its applicability in the real world. Despite claims of robust generalization, we find that the advancements of previous works are attributed mainly on trained categories, exposing a lack of generalization to novel classes. In this paper, we explore boosting existing models from a data-centric perspective. We propose DreamMask, which systematically explores how to generate training data in the open-vocabulary setting, and how to train the model with both real and synthetic data. For the first part, we propose an automatic data generation pipeline with off-the-shelf models. We propose crucial designs for vocabulary expansion, layout arrangement, data filtering, etc. Equipped with these techniques, our generated data could significantly outperform the manually collected web data. To train the model with generated data, a synthetic-real alignment loss is designed to bridge the representation gap, bringing noticeable improvements across multiple benchmarks. In general, DreamMask significantly simplifies the collection of large-scale training data, serving as a plug-and-play enhancement for existing methods. For instance, when trained on COCO and tested on ADE20K, the model equipped with DreamMask outperforms the previous state-of-the-art by a substantial margin of 2.1% mIoU.
Chinese: DreamMask提出了一种以数据为中心的方法,通过生成合成训练数据并与真实数据对齐来增强开放词汇全景分割,显著提升了模型的泛化能力,并在ADE20K等基准测试中大幅超越先前方法。
English: DreamMask introduces a data-centric approach to enhance open-vocabulary panoptic segmentation by generating synthetic training data and aligning it with real data, significantly improving model generalization and outperforming prior methods on benchmarks like ADE20K.

Authors:Yuanpeng Tu, Hao Luo, Xi Chen, Sihui Ji, Xiang Bai, Hengshuang Zhao
Title: VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion Control
Abstract:
Despite significant advancements in video generation, inserting a given object into videos remains a challenging task. The difficulty lies in preserving the appearance details of the reference object and accurately modeling coherent motions at the same time. In this paper, we propose VideoAnydoor, a zero-shot video object insertion framework with high-fidelity detail preservation and precise motion control. Starting from a text-to-video model, we utilize an ID extractor to inject the global identity and leverage a box sequence to control the overall motion. To preserve the detailed appearance and meanwhile support fine-grained motion control, we design a pixel warper. It takes the reference image with arbitrary key-points and the corresponding key-point trajectories as inputs. It warps the pixel details according to the trajectories and fuses the warped features with the diffusion U-Net, thus improving detail preservation and supporting users in manipulating the motion trajectories. In addition, we propose a training strategy involving both videos and static images with a weighted loss to enhance insertion quality. VideoAnydoor demonstrates significant superiority over existing methods and naturally supports various downstream applications (e.g., talking head generation, video virtual try-on, multi-region editing) without task-specific fine-tuning.
Chinese: VideoAnydoor 是一种零样本视频对象插入框架,通过像素扭曲器保留细节外观并利用关键点轨迹实现精细运动控制,无需特定任务微调即可显著优于现有方法。
English: VideoAnydoor is a zero-shot framework that enables high-fidelity object insertion into videos by preserving detailed appearances through a pixel warper and allowing precise motion control via key-point trajectories, outperforming existing methods without requiring task-specific adjustments.

Authors:Harshith Gowrachari, Giovanni Stabile, Gianluigi Rozza
Title: Model Reduction for Transport-Dominated Problems via Cross-Correlation Based Snapshot Registration
Abstract:
Traditional linear approximation methods, such as proper orthogonal decomposition and the reduced basis method, are ill-suited for transport-dominated problems due to the slow decay of the Kolmogorov $n$-width, leading to inefficient and inaccurate reduced-order models. In this work, we propose a model reduction approach for transport-dominated problems by employing cross-correlation based snapshot registration to accelerate the Kolmogorov $n$-width decay, thereby enabling the construction of efficient and accurate reduced-order models using linear approximation methods. We propose a complete framework comprising offline-online stages for the development of reduced order models using the cross-correlation based snapshots registration. The effectiveness of the proposed approach is demonstrated using two test cases: 1D travelling waves and the higher-order methods benchmark test case, 2D isentropic convective vortex.
Chinese: 本研究提出了一种基于互相关快照配准的模型降阶方法,通过加速Kolmogorov n-宽度的衰减,有效解决了输运主导问题,从而构建出高效且精确的线性降阶模型。
English: The study introduces a model reduction method using cross-correlation based snapshot registration to address transport-dominated problems, enabling efficient and accurate reduced-order models by accelerating the decay of the Kolmogorov n-width.

Authors:Yuqian Yuan, Hang Zhang, Wentong Li, Zesen Cheng, Boqiang Zhang, Long Li, Xin Li, Deli Zhao, Wenqiao Zhang, Yueting Zhuang, Jianke Zhu, Lidong Bing
Title: VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM
Abstract:
Video Large Language Models (Video LLMs) have recently exhibited remarkable capabilities in general video understanding. However, they mainly focus on holistic comprehension and struggle with capturing fine-grained spatial and temporal details. Besides, the lack of high-quality object-level video instruction data and a comprehensive benchmark further hinders their advancements. To tackle these challenges, we introduce the VideoRefer Suite to empower Video LLM for finer-level spatial-temporal video understanding, i.e., enabling perception and reasoning on any objects throughout the video. Specially, we thoroughly develop VideoRefer Suite across three essential aspects: dataset, model, and benchmark. Firstly, we introduce a multi-agent data engine to meticulously curate a large-scale, high-quality object-level video instruction dataset, termed VideoRefer-700K. Next, we present the VideoRefer model, which equips a versatile spatial-temporal object encoder to capture precise regional and sequential representations. Finally, we meticulously create a VideoRefer-Bench to comprehensively assess the spatial-temporal understanding capability of a Video LLM, evaluating it across various aspects. Extensive experiments and analyses demonstrate that our VideoRefer model not only achieves promising performance on video referring benchmarks but also facilitates general video understanding capabilities.
Chinese: VideoRefer套件通过提供大规模数据集、配备对象编码器的专用模型以及全面基准测试,增强了视频大语言模型在细粒度时空视频理解方面的能力,在对象级和通用视频任务中均表现出色。
English: The VideoRefer Suite enhances Video LLMs for fine-grained spatial-temporal video understanding by providing a large-scale dataset, a specialized model with an object encoder, and a comprehensive benchmark, achieving superior performance in both object-level and general video tasks.

Authors:Alessio Russo, Alberto Maria Metelli, Marcello Restelli
Title: Achieving $\widetilde{\mathcal{O}}(\sqrt{T})$ Regret in Average-Reward POMDPs with Known Observation Models
Abstract:
We tackle average-reward infinite-horizon POMDPs with an unknown transition model but a known observation model, a setting that has been previously addressed in two limiting ways: (i) frequentist methods relying on suboptimal stochastic policies having a minimum probability of choosing each action, and (ii) Bayesian approaches employing the optimal policy class but requiring strong assumptions about the consistency of employed estimators. Our work removes these limitations by proving convenient estimation guarantees for the transition model and introducing an optimistic algorithm that leverages the optimal class of deterministic belief-based policies. We introduce modifications to existing estimation techniques providing theoretical guarantees separately for each estimated action transition matrix. Unlike existing estimation methods that are unable to use samples from different policies, we present a novel and simple estimator that overcomes this barrier. This new data-efficient technique, combined with the proposed \emph{Action-wise OAS-UCRL} algorithm and a tighter theoretical analysis, leads to the first approach enjoying a regret guarantee of order $\mathcal{O}(\sqrt{T \,\log T})$ when compared against the optimal policy, thus improving over state of the art techniques. Finally, theoretical results are validated through numerical simulations showing the efficacy of our method against baseline methods.
中文摘要:本研究针对转移模型未知的POMDP问题,提出了新型估计器和行动优化算法,通过确定性信念策略实现了数据高效利用,获得了\(\mathcal{O}(\sqrt{T \,\log T})\)的近乎最优遗憾界,突破了现有方法的局限性。
English Summary: This work introduces a novel estimator and the Action-wise OAS-UCRL algorithm for average-reward POMDPs with unknown transitions, achieving improved data efficiency and a near-optimal regret bound of \(\mathcal{O}(\sqrt{T \,\log T})\) while using deterministic belief-based policies.

Authors:Meng Luo, Han Zhang, Shengqiong Wu, Bobo Li, Hong Han, Hao Fei
Title: NUS-Emo at SemEval-2024 Task 3: Instruction-Tuning LLM for Multimodal Emotion-Cause Analysis in Conversations
Abstract:
This paper describes the architecture of our system developed for Task 3 of SemEval-2024: Multimodal Emotion-Cause Analysis in Conversations. Our project targets the challenges of subtask 2, dedicated to Multimodal Emotion-Cause Pair Extraction with Emotion Category (MECPE-Cat), and constructs a dual-component system tailored to the unique challenges of this task. We divide the task into two subtasks: emotion recognition in conversation (ERC) and emotion-cause pair extraction (ECPE). To address these subtasks, we capitalize on the abilities of Large Language Models (LLMs), which have consistently demonstrated state-of-the-art performance across various natural language processing tasks and domains. Most importantly, we design an approach of emotion-cause-aware instruction-tuning for LLMs, to enhance the perception of the emotions with their corresponding causal rationales. Our method enables us to adeptly navigate the complexities of MECPE-Cat, achieving a weighted average 34.71% F1 score of the task, and securing the 2nd rank on the leaderboard. The code and metadata to reproduce our experiments are all made publicly available.
中文: 本文提出了一种基于情感-因果感知指令调优大语言模型的双组件系统,用于多模态情感-因果对提取,在SemEval-2024任务3中以34.71%的F1分数获得第二名。
English: This paper presents a dual-component system using emotion-cause-aware instruction-tuned LLMs for multimodal emotion-cause pair extraction, achieving second place in SemEval-2024 Task 3 with a 34.71% F1 score.

Authors:Shujuan Huang, Chunyu Lin, Yao Zhao
Title: What Really Matters for Learning-based LiDAR-Camera Calibration
Abstract:
Calibration is an essential prerequisite for the accurate data fusion of LiDAR and camera sensors. Traditional calibration techniques often require specific targets or suitable scenes to obtain reliable 2D-3D correspondences. To tackle the challenge of target-less and online calibration, deep neural networks have been introduced to solve the problem in a data-driven manner. While previous learning-based methods have achieved impressive performance on specific datasets, they still struggle in complex real-world scenarios. Most existing works focus on improving calibration accuracy but overlook the underlying mechanisms. In this paper, we revisit the development of learning-based LiDAR-Camera calibration and encourage the community to pay more attention to the underlying principles to advance practical applications. We systematically analyze the paradigm of mainstream learning-based methods, and identify the critical limitations of regression-based methods with the widely used data generation pipeline. Our findings reveal that most learning-based methods inadvertently operate as retrieval networks, focusing more on single-modality distributions rather than cross-modality correspondences. We also investigate how the input data format and preprocessing operations impact network performance and summarize the regression clues to inform further improvements.
Chinese: 本文批判性地审视了基于学习的激光雷达-相机标定方法,揭示多数方法实为侧重单模态数据而非跨模态对应的检索网络,并呼吁深入研究底层机制以提升实际应用能力。
English: This paper critically reviews learning-based LiDAR-camera calibration methods, revealing that many function as retrieval networks prioritizing single-modality data over cross-modality correspondences, and calls for deeper investigation into underlying mechanisms to enhance real-world applicability.

Authors:Pengfei Zhu, Peng Shu, Mengshi Qi, Liang Liu, Huadong Ma
Title: Target-driven Self-Distillation for Partial Observed Trajectories Forecasting
Abstract:
Accurate prediction of future trajectories of traffic agents is essential for ensuring safe autonomous driving. However, partially observed trajectories can significantly degrade the performance of even state-of-the-art models. Previous approaches often rely on knowledge distillation to transfer features from fully observed trajectories to partially observed ones. This involves firstly training a fully observed model and then using a distillation process to create the final model. While effective, they require multi-stage training, making the training process very expensive. Moreover, knowledge distillation can lead to a performance degradation of the model. In this paper, we introduce a Target-driven Self-Distillation method (TSD) for motion forecasting. Our method leverages predicted accurate targets to guide the model in making predictions under partial observation conditions. By employing self-distillation, the model learns from the feature distributions of both fully observed and partially observed trajectories during a single end-to-end training process. This enhances the model's ability to predict motion accurately in both fully observed and partially observed scenarios. We evaluate our method on multiple datasets and state-of-the-art motion forecasting models. Extensive experimental results demonstrate that our approach achieves significant performance improvements in both settings. To facilitate further research, we will release our code and model checkpoints.
中文: 本文提出的目标驱动自蒸馏方法通过端到端训练,在完整和部分观测条件下均能实现精准的运动预测,在多个数据集上取得了显著性能提升。
English: This paper introduces a Target-driven Self-Distillation method that enables end-to-end training for accurate motion forecasting under both fully and partially observed conditions, achieving significant performance improvements across multiple datasets.

Authors:Zheng Liu, Chaofan Li, Shitao Xiao, Chaozhuo Li, Defu Lian, Yingxia Shao
Title: Matryoshka Re-Ranker: A Flexible Re-Ranking Architecture With Configurable Depth and Width
Abstract:
Large language models (LLMs) provide powerful foundations to perform fine-grained text re-ranking. However, they are often prohibitive in reality due to constraints on computation bandwidth. In this work, we propose a \textbf{flexible} architecture called \textbf{Matroyshka Re-Ranker}, which is designed to facilitate \textbf{runtime customization} of model layers and sequence lengths at each layer based on users' configurations. Consequently, the LLM-based re-rankers can be made applicable across various real-world situations. The increased flexibility may come at the cost of precision loss. To address this problem, we introduce a suite of techniques to optimize the performance. First, we propose \textbf{cascaded self-distillation}, where each sub-architecture learns to preserve a precise re-ranking performance from its super components, whose predictions can be exploited as smooth and informative teacher signals. Second, we design a \textbf{factorized compensation mechanism}, where two collaborative Low-Rank Adaptation modules, vertical and horizontal, are jointly employed to compensate for the precision loss resulted from arbitrary combinations of layer and sequence compression. We perform comprehensive experiments based on the passage and document retrieval datasets from MSMARCO, along with all public datasets from BEIR benchmark. In our experiments, Matryoshka Re-Ranker substantially outperforms the existing methods, while effectively preserving its superior performance across various forms of compression and different application scenarios.
中文: Matroyshka Re-Ranker 提出了一种灵活的架构,支持运行时自定义模型层和序列长度,通过级联自蒸馏和因子化补偿机制在解决计算限制的同时保持优异性能。
English: The Matroyshka Re-Ranker introduces a flexible architecture that allows runtime customization of model layers and sequence lengths, addressing computational constraints while maintaining performance through cascaded self-distillation and factorized compensation mechanisms.

Authors:Yu Qiao, Apurba Adhikary, Huy Q. Le, Eui-Nam Huh, Zhu Han, Choong Seon Hong
Title: Towards Communication-Efficient Adversarial Federated Learning for Robust Edge Intelligence
Abstract:
Federated learning (FL) has gained significant attention for enabling decentralized training on edge networks without exposing raw data. However, FL models remain susceptible to adversarial attacks and performance degradation in non-IID data settings, thus posing challenges to both robustness and accuracy. This paper aims to achieve communication-efficient adversarial federated learning (AFL) by leveraging a pre-trained model to enhance both robustness and accuracy under adversarial attacks and non-IID challenges in AFL. By leveraging the knowledge from a pre-trained model for both clean and adversarial images, we propose a pre-trained model-guided adversarial federated learning (PM-AFL) framework. This framework integrates vanilla and adversarial mixture knowledge distillation to effectively balance accuracy and robustness while promoting local models to learn from diverse data. Specifically, for clean accuracy, we adopt a dual distillation strategy where the class probabilities of randomly paired images, and their blended versions are aligned between the teacher model and the local models. For adversarial robustness, we employ a similar distillation approach but replace clean samples on the local side with adversarial examples. Moreover, by considering the bias between local and global models, we also incorporate a consistency regularization term to ensure that local adversarial predictions stay aligned with their corresponding global clean ones. These strategies collectively enable local models to absorb diverse knowledge from the teacher model while maintaining close alignment with the global model, thereby mitigating overfitting to local optima and enhancing the generalization of the global model. Experiments demonstrate that the PM-AFL-based framework not only significantly outperforms other methods but also maintains communication efficiency.
中文: 本文提出了一种预训练模型引导的对抗性联邦学习(PM-AFL)框架,通过融合混合知识蒸馏和一致性正则化,在分布式训练中同时提升精度与鲁棒性,实现了卓越性能并保持通信效率。
English: This paper introduces a pre-trained model-guided adversarial federated learning (PM-AFL) framework that enhances both accuracy and robustness in decentralized training by integrating mixture knowledge distillation and consistency regularization, achieving superior performance and communication efficiency.

Authors:Deyu Zhou, Quan Sun, Yuang Peng, Kun Yan, Runpei Dong, Duomin Wang, Zheng Ge, Nan Duan, Xiangyu Zhang, Lionel M. Ni, Heung-Yeung Shum
Title: Taming Teacher Forcing for Masked Autoregressive Video Generation
Abstract:
We introduce MAGI, a hybrid video generation framework that combines masked modeling for intra-frame generation with causal modeling for next-frame generation. Our key innovation, Complete Teacher Forcing (CTF), conditions masked frames on complete observation frames rather than masked ones (namely Masked Teacher Forcing, MTF), enabling a smooth transition from token-level (patch-level) to frame-level autoregressive generation. CTF significantly outperforms MTF, achieving a +23% improvement in FVD scores on first-frame conditioned video prediction. To address issues like exposure bias, we employ targeted training strategies, setting a new benchmark in autoregressive video generation. Experiments show that MAGI can generate long, coherent video sequences exceeding 100 frames, even when trained on as few as 16 frames, highlighting its potential for scalable, high-quality video generation.
中文: MAGI是一种结合掩码建模与因果建模的混合视频生成框架,其核心创新完全教师强制技术显著提升了帧生成质量,在视频预测和生成长序列方面表现卓越。
English: MAGI is a hybrid video generation framework that integrates masked and causal modeling, featuring Complete Teacher Forcing to enhance frame generation and achieve superior performance in video prediction and long-sequence coherence.

Authors:Yonghui Yang, Le Wu, Zhuangzhuang He, Zhengwei Wu, Richang Hong, Meng Wang
Title: Less is More: Information Bottleneck Denoised Multimedia Recommendation
Abstract:
Empowered by semantic-rich content information, multimedia recommendation has emerged as a potent personalized technique. Current endeavors center around harnessing multimedia content to refine item representation or uncovering latent item-item structures based on modality similarity. Despite the effectiveness, we posit that these methods are usually suboptimal due to the introduction of irrelevant multimedia features into recommendation tasks. This stems from the fact that generic multimedia feature extractors, while well-designed for domain-specific tasks, can inadvertently introduce task-irrelevant features, leading to potential misguidance of recommenders. In this work, we propose a denoised multimedia recommendation paradigm via the Information Bottleneck principle (IB). Specifically, we propose a novel Information Bottleneck denoised Multimedia Recommendation (IBMRec) model to tackle the irrelevant feature issue. IBMRec removes task-irrelevant features from both feature and item-item structure perspectives, which are implemented by two-level IB learning modules: feature-level (FIB) and graph-level (GIB). In particular, FIB focuses on learning the minimal yet sufficient multimedia features. This is achieved by maximizing the mutual information between multimedia representation and recommendation tasks, while concurrently minimizing it between multimedia representation and pre-trained multimedia features. Furthermore, GIB is designed to learn the robust item-item graph structure, it refines the item-item graph based on preference affinity, then minimizes the mutual information between the original graph and the refined one. Extensive experiments across three benchmarks validate the effectiveness of our proposed model, showcasing high performance, and applicability to various multimedia recommenders.
中文摘要:本文提出IBMRec模型,通过信息瓶颈原理从特征和物品关联结构两个层面去除多媒体推荐中的无关特征,利用特征级和图级学习模块确保信息相关性,从而提升推荐性能。
English Summary: This paper introduces IBMRec, a multimedia recommendation model that applies the Information Bottleneck principle to eliminate irrelevant features from both feature representations and item-item structures, enhancing recommendation accuracy by focusing on task-relevant information.

Authors:Zihan Qiu, Zeyu Huang, Bo Zheng, Kaiyue Wen, Zekun Wang, Rui Men, Ivan Titov, Dayiheng Liu, Jingren Zhou, Junyang Lin
Title: Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models
Abstract:
This paper revisits the implementation of $\textbf{L}$oad-$\textbf{b}$alancing $\textbf{L}$oss (LBL) when training Mixture-of-Experts (MoEs) models. Specifically, LBL for MoEs is defined as $N_E \sum_{i=1}^{N_E} f_i p_i$, where $N_E$ is the total number of experts, $f_i$ represents the frequency of expert $i$ being selected, and $p_i$ denotes the average gating score of the expert $i$. Existing MoE training frameworks usually employ the parallel training strategy so that $f_i$ and the LBL are calculated within a $\textbf{micro-batch}$ and then averaged across parallel groups. In essence, a micro-batch for training billion-scale LLMs normally contains very few sequences. So, the micro-batch LBL is almost at the sequence level, and the router is pushed to distribute the token evenly within each sequence. Under this strict constraint, even tokens from a domain-specific sequence ($\textit{e.g.}$, code) are uniformly routed to all experts, thereby inhibiting expert specialization. In this work, we propose calculating LBL using a $\textbf{global-batch}$ to loose this constraint. Because a global-batch contains much more diverse sequences than a micro-batch, which will encourage load balance at the corpus level. Specifically, we introduce an extra communication step to synchronize $f_i$ across micro-batches and then use it to calculate the LBL. Through experiments on training MoEs-based LLMs (up to $\textbf{42.8B}$ total parameters and $\textbf{400B}$ tokens), we surprisingly find that the global-batch LBL strategy yields excellent performance gains in both pre-training perplexity and downstream tasks. Our analysis reveals that the global-batch LBL also greatly improves the domain specialization of MoE experts.
本文提出在MoE训练中使用全局批次而非微批次计算负载均衡损失,该方法增强了专家的领域专业化能力,并在预训练和下游任务中显著提升了模型性能。
This paper proposes using a global-batch strategy instead of micro-batches to calculate Load-balancing Loss in MoE training, which enhances expert specialization and improves model performance across pre-training and downstream tasks.

Authors:Jingjing Tang, Erica Cooper, Xin Wang, Junichi Yamagishi, George Fazekas
Title: Towards An Integrated Approach for Expressive Piano Performance Synthesis from Music Scores
Abstract:
This paper presents an integrated system that transforms symbolic music scores into expressive piano performance audio. By combining a Transformer-based Expressive Performance Rendering (EPR) model with a fine-tuned neural MIDI synthesiser, our approach directly generates expressive audio performances from score inputs. To the best of our knowledge, this is the first system to offer a streamlined method for converting score MIDI files lacking expression control into rich, expressive piano performances. We conducted experiments using subsets of the ATEPP dataset, evaluating the system with both objective metrics and subjective listening tests. Our system not only accurately reconstructs human-like expressiveness, but also captures the acoustic ambience of environments such as concert halls and recording studios. Additionally, the proposed system demonstrates its ability to achieve musical expressiveness while ensuring good audio quality in its outputs.
中文: 本文提出了一种创新系统,通过结合基于Transformer的表演渲染模型与神经MIDI合成器,将符号乐谱转换为富有表现力的钢琴音频,能准确还原人性化演奏效果并捕捉环境声学特征。
English: This paper introduces a pioneering system that converts symbolic music scores into expressive piano audio by integrating a Transformer-based performance rendering model with a neural MIDI synthesizer, producing realistic performances with ambient acoustics.

Authors:Chen Zhang, Xinyi Dai, Yaxiong Wu, Qu Yang, Yasheng Wang, Ruiming Tang, Yong Liu
Title: A Survey on Multi-Turn Interaction Capabilities of Large Language Models
Abstract:
Multi-turn interaction in the dialogue system research refers to a system's ability to maintain context across multiple dialogue turns, enabling it to generate coherent and contextually relevant responses. Recent advancements in large language models (LLMs) have significantly expanded the scope of multi-turn interaction, moving beyond chatbots to enable more dynamic agentic interactions with users or environments. In this paper, we provide a focused review of the multi-turn capabilities of LLMs, which are critical for a wide range of downstream applications, including conversational search and recommendation, consultation services, and interactive tutoring. This survey explores four key aspects: (1) the core model capabilities that contribute to effective multi-turn interaction, (2) how multi-turn interaction is evaluated in current practice, (3) the general algorithms used to enhance multi-turn interaction, and (4) potential future directions for research in this field.
中文摘要:本文综述了大语言模型如何通过保持多轮对话的上下文连贯性来增强对话系统,重点探讨了其核心能力、评估方法、优化算法及未来研究方向。
English Summary: This paper reviews how large language models enhance multi-turn dialogue systems by maintaining contextual coherence across interactions, examining their core capabilities, evaluation methods, improvement algorithms, and future research directions.

Authors:Wulian Yun, Mengshi Qi, Fei Peng, Huadong Ma
Title: A New Teacher-Reviewer-Student Framework for Semi-supervised 2D Human Pose Estimation
Abstract:
Conventional 2D human pose estimation methods typically require extensive labeled annotations, which are both labor-intensive and expensive. In contrast, semi-supervised 2D human pose estimation can alleviate the above problems by leveraging a large amount of unlabeled data along with a small portion of labeled data. Existing semi-supervised 2D human pose estimation methods update the network through backpropagation, ignoring crucial historical information from the previous training process. Therefore, we propose a novel semi-supervised 2D human pose estimation method by utilizing a newly designed Teacher-Reviewer-Student framework. Specifically, we first mimic the phenomenon that human beings constantly review previous knowledge for consolidation to design our framework, in which the teacher predicts results to guide the student's learning and the reviewer stores important historical parameters to provide additional supervision signals. Secondly, we introduce a Multi-level Feature Learning strategy, which utilizes the outputs from different stages of the backbone to estimate the heatmap to guide network training, enriching the supervisory information while effectively capturing keypoint relationships. Finally, we design a data augmentation strategy, i.e., Keypoint-Mix, to perturb pose information by mixing different keypoints, thus enhancing the network's ability to discern keypoints. Extensive experiments on publicly available datasets, demonstrate our method achieves significant improvements compared to the existing methods.
中文: 本文提出了一种新颖的半监督二维人体姿态估计方法,采用教师-评审者-学生框架,结合多层次特征学习和关键点混合数据增强策略,在公开数据集上显著优于现有方法。
English: This paper introduces a novel semi-supervised 2D human pose estimation method using a Teacher-Reviewer-Student framework, enhanced by multi-level feature learning and Keypoint-Mix data augmentation, which significantly outperforms existing approaches on public datasets.

Authors:Yichen Li, Yuying Wang, Jiahua Dong, Haozhao Wang, Yining Qi, Rui Zhang, Ruixuan Li
Title: Resource-Constrained Federated Continual Learning: What Does Matter?
Abstract:
Federated Continual Learning (FCL) aims to enable sequentially privacy-preserving model training on streams of incoming data that vary in edge devices by preserving previous knowledge while adapting to new data. Current FCL literature focuses on restricted data privacy and access to previously seen data while imposing no constraints on the training overhead. This is unreasonable for FCL applications in real-world scenarios, where edge devices are primarily constrained by resources such as storage, computational budget, and label rate. We revisit this problem with a large-scale benchmark and analyze the performance of state-of-the-art FCL approaches under different resource-constrained settings. Various typical FCL techniques and six datasets in two incremental learning scenarios (Class-IL and Domain-IL) are involved in our experiments. Through extensive experiments amounting to a total of over 1,000+ GPU hours, we find that, under limited resource-constrained settings, existing FCL approaches, with no exception, fail to achieve the expected performance. Our conclusions are consistent in the sensitivity analysis. This suggests that most existing FCL methods are particularly too resource-dependent for real-world deployment. Moreover, we study the performance of typical FCL techniques with resource constraints and shed light on future research directions in FCL.
中文: 联邦持续学习在现实部署中面临挑战,因其对资源依赖过高,现有方法在存储、计算和标注率受限时均无法达到预期性能。
English: Federated Continual Learning faces challenges in real-world deployment due to its high resource dependency, as current methods fail to achieve expected performance under constrained settings like limited storage, computation, and label rates.

Authors:Jiaxin Guo, Yuanchang Luo, Daimeng Wei, Ling Zhang, Zongyao Li, Hengchao Shang, Zhiqiang Rao, Shaojun Li, Jinlong Yang, Zhanglin Wu, Hao Yang
Title: Doc-Guided Sent2Sent++: A Sent2Sent++ Agent with Doc-Guided memory for Document-level Machine Translation
Abstract:
The field of artificial intelligence has witnessed significant advancements in natural language processing, largely attributed to the capabilities of Large Language Models (LLMs). These models form the backbone of Agents designed to address long-context dependencies, particularly in Document-level Machine Translation (DocMT). DocMT presents unique challenges, with quality, consistency, and fluency being the key metrics for evaluation. Existing approaches, such as Doc2Doc and Doc2Sent, either omit sentences or compromise fluency. This paper introduces Doc-Guided Sent2Sent++, an Agent that employs an incremental sentence-level forced decoding strategy \textbf{to ensure every sentence is translated while enhancing the fluency of adjacent sentences.} Our Agent leverages a Doc-Guided Memory, focusing solely on the summary and its translation, which we find to be an efficient approach to maintaining consistency. Through extensive testing across multiple languages and domains, we demonstrate that Sent2Sent++ outperforms other methods in terms of quality, consistency, and fluency. The results indicate that, our approach has achieved significant improvements in metrics such as s-COMET, d-COMET, LTCR-$1_f$, and document-level perplexity (d-ppl). The contributions of this paper include a detailed analysis of current DocMT research, the introduction of the Sent2Sent++ decoding method, the Doc-Guided Memory mechanism, and validation of its effectiveness across languages and domains.
中文: 本文提出的Doc-Guided Sent2Sent++代理通过增量式强制解码和文档引导记忆机制,确保所有句子被翻译并提升相邻句子的流畅性,在多种语言和领域中质量、一致性和流畅性均优于现有方法。
English: This paper introduces Doc-Guided Sent2Sent++, an agent that improves document-level machine translation by ensuring all sentences are translated and enhancing fluency through incremental forced decoding and a memory mechanism, outperforming existing methods in quality, consistency, and fluency across multiple languages and domains.

Authors:Grik Tadevosyan, Maksim Osipenko, Demetros Aschu, Aleksey Fedoseev, Valerii Serpiva, Oleg Sautenkov, Sausar Karaf, Dzmitry Tsetserukou
Title: SafeSwarm: Decentralized Safe RL for the Swarm of Drones Landing in Dense Crowds
Abstract:
This paper introduces a safe swarm of drones capable of performing landings in crowded environments robustly by relying on Reinforcement Learning techniques combined with Safe Learning. The developed system allows us to teach the swarm of drones with different dynamics to land on moving landing pads in an environment while avoiding collisions with obstacles and between agents. The safe barrier net algorithm was developed and evaluated using a swarm of Crazyflie 2.1 micro quadrotors, which were tested indoors with the Vicon motion capture system to ensure precise localization and control. Experimental results show that our system achieves landing accuracy of 2.25 cm with a mean time of 17 s and collision-free landings, underscoring its effectiveness and robustness in real-world scenarios. This work offers a promising foundation for applications in environments where safety and precision are paramount.
中文: 本文提出一种基于强化学习与安全学习的安全无人机集群系统,能在移动平台上实现无碰撞降落,精度达2.25厘米且耗时17秒,展现了优异的实际应用性能。
English: This paper presents a safe drone swarm system using Reinforcement Learning and Safe Learning to achieve collision-free landings on moving platforms with 2.25 cm accuracy in 17 seconds, demonstrating robust real-world performance.

Authors:Siyu Yan, Tiancheng Liu, Weikai Yang, Nan Tang, Yuyu Luo
Title: ChartEditor: A Human-AI Paired Tool for Authoring Pictorial Charts
Abstract:
Pictorial charts are favored for their memorability and visual appeal, offering a more engaging alternative to basic charts. However, their creation can be complex and time-consuming due to the lack of native support in popular visualization tools like Tableau. While AI-generated content (AIGC) tools have lowered the barrier to creating pictorial charts, they often lack precise design control. To address this issue, we introduce ChartEditor, a human-AI paired tool that transforms basic charts into pictorial versions based on user intent. ChartEditor decomposes chart images into visual components and organizes them within a hierarchical tree. Based on this tree, users can express their intent in natural language, which is then translated into modifications to the hierarchy. In addition, users can directly interact with and modify specific chart components via an intuitive interface to achieve fine-grained design control. A user study demonstrates the effectiveness and usability of ChartEditor in simplifying the creation of pictorial charts.
中文摘要:ChartEditor是一款人机协作工具,通过解析自然语言指令并支持直接组件编辑,将基础图表转化为图示化版本,显著简化了图示图表的制作流程。
English Summary: ChartEditor is a human-AI collaboration tool that converts standard charts into pictorial versions by interpreting natural language commands and enabling direct component manipulation, effectively simplifying their creation process.

Authors:Zhenru Zhang, Chujie Zheng, Yangzhen Wu, Beichen Zhang, Runji Lin, Bowen Yu, Dayiheng Liu, Jingren Zhou, Junyang Lin
Title: The Lessons of Developing Process Reward Models in Mathematical Reasoning
Abstract:
Process Reward Models (PRMs) emerge as a promising approach for process supervision in mathematical reasoning of Large Language Models (LLMs), which aim to identify and mitigate intermediate errors in the reasoning processes. However, the development of effective PRMs faces significant challenges, particularly in data annotation and evaluation methodologies. In this paper, through extensive experiments, we demonstrate that commonly used Monte Carlo (MC) estimation-based data synthesis for PRMs typically yields inferior performance and generalization compared to LLM-as-a-judge and human annotation methods. MC estimation relies on completion models to evaluate current-step correctness, leading to inaccurate step verification. Furthermore, we identify potential biases in conventional Best-of-N (BoN) evaluation strategies for PRMs: (1) The unreliable policy models generate responses with correct answers but flawed processes, leading to a misalignment between the evaluation criteria of BoN and the PRM objectives of process verification. (2) The tolerance of PRMs of such responses leads to inflated BoN scores. (3) Existing PRMs have a significant proportion of minimum scores concentrated on the final answer steps, revealing the shift from process to outcome-based assessment in BoN Optimized PRMs. To address these challenges, we develop a consensus filtering mechanism that effectively integrates MC estimation with LLM-as-a-judge and advocates a more comprehensive evaluation framework that combines response-level and step-level metrics. Based on the mechanisms, we significantly improve both model performance and data efficiency in the BoN evaluation and the step-wise error identification task. Finally, we release a new state-of-the-art PRM that outperforms existing open-source alternatives and provides practical guidelines for future research in building process supervision models.
中文: 过程奖励模型在数据合成和评估方面面临挑战,但通过整合蒙特卡洛估计与大语言模型评判的新共识过滤机制显著提升了性能,最终开发出用于数学推理的顶尖过程监督模型。
English: Process Reward Models (PRMs) face challenges in data synthesis and evaluation, but a new consensus filtering mechanism integrating Monte Carlo estimation with LLM-as-a-judge significantly improves performance, resulting in a state-of-the-art PRM for mathematical reasoning.

Authors:Jiaxuan Peng, Mengshi Qi, Dong Zhao, Huadong Ma
Title: Towards Balanced Continual Multi-Modal Learning in Human Pose Estimation
Abstract:
3D human pose estimation (3D HPE) has emerged as a prominent research topic, particularly in the realm of RGB-based methods. However, RGB images are susceptible to limitations such as sensitivity to lighting conditions and potential user discomfort. Consequently, multi-modal sensing, which leverages non-intrusive sensors, is gaining increasing attention. Nevertheless, multi-modal 3D HPE still faces challenges, including modality imbalance and the imperative for continual learning. In this work, we introduce a novel balanced continual multi-modal learning method for 3D HPE, which harnesses the power of RGB, LiDAR, mmWave, and WiFi. Specifically, we propose a Shapley value-based contribution algorithm to quantify the contribution of each modality and identify modality imbalance. To address this imbalance, we employ a re-learning strategy. Furthermore, recognizing that raw data is prone to noise contamination, we develop a novel denoising continual learning approach. This approach incorporates a noise identification and separation module to mitigate the adverse effects of noise and collaborates with the balanced learning strategy to enhance optimization. Additionally, an adaptive EWC mechanism is employed to alleviate catastrophic forgetting. We conduct extensive experiments on the widely-adopted multi-modal dataset, MM-Fi, which demonstrate the superiority of our approach in boosting 3D pose estimation and mitigating catastrophic forgetting in complex scenarios. We will release our codes.
中文: 本研究提出了一种平衡的持续多模态学习方法,通过量化模态贡献、重新学习策略和自适应机制解决3D人体姿态估计中的模态不平衡和噪声问题,从而提升性能并减少遗忘。
English: This study introduces a balanced continual multi-modal learning method for 3D human pose estimation, addressing modality imbalance and noise through contribution quantification, re-learning strategies, and adaptive mechanisms to enhance performance and reduce forgetting.

Authors:Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, Meishan Zhang, Mong-Li Lee, Wynne Hsu
Title: Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition
Abstract:
Existing research of video understanding still struggles to achieve in-depth comprehension and reasoning in complex videos, primarily due to the under-exploration of two key bottlenecks: fine-grained spatial-temporal perceptive understanding and cognitive-level video scene comprehension. This paper bridges the gap by presenting a novel solution. We first introduce a novel video Multimodal Large Language Model (MLLM), MotionEpic, which achieves fine-grained pixel-level spatial-temporal video grounding by integrating video spatial-temporal scene graph (STSG) representation. Building upon MotionEpic, we then develop a Video-of-Thought (VoT) reasoning framework. VoT inherits the Chain-of-Thought (CoT) core, breaking down a complex task into simpler and manageable sub-problems, and addressing them step-by-step from a low-level pixel perception to high-level cognitive interpretation. Extensive experiments across various complex video QA benchmarks demonstrate that our overall framework strikingly boosts existing state-of-the-art. To our knowledge, this is the first attempt at successfully implementing the CoT technique for achieving human-level video reasoning, where we show great potential in extending it to a wider range of video understanding scenarios. Project is open at https://haofei.vip/VoT
中文摘要:本文提出MotionEpic视频多模态大语言模型实现细粒度时空定位,并基于此构建视频思维链推理框架,通过从低级像素感知到高级认知理解的逐步推理,显著提升了复杂视频的理解与推理能力。
English Summary: This paper introduces MotionEpic, a video Multimodal Large Language Model that achieves fine-grained spatial-temporal video grounding, and builds upon it with a Video-of-Thought reasoning framework to enable step-by-step cognitive interpretation, significantly advancing complex video understanding and reasoning capabilities.

Authors:Hongyi Miao, Jun Jia, Yankun Cao, Yingjie Zhou, Yanwei Jiang, Zhi Liu, Guangtao Zhai
Title: Ultrasound-QBench: Can LLMs Aid in Quality Assessment of Ultrasound Imaging?
Abstract:
With the dramatic upsurge in the volume of ultrasound examinations, low-quality ultrasound imaging has gradually increased due to variations in operator proficiency and imaging circumstances, imposing a severe burden on diagnosis accuracy and even entailing the risk of restarting the diagnosis in critical cases. To assist clinicians in selecting high-quality ultrasound images and ensuring accurate diagnoses, we introduce Ultrasound-QBench, a comprehensive benchmark that systematically evaluates multimodal large language models (MLLMs) on quality assessment tasks of ultrasound images. Ultrasound-QBench establishes two datasets collected from diverse sources: IVUSQA, consisting of 7,709 images, and CardiacUltraQA, containing 3,863 images. These images encompassing common ultrasound imaging artifacts are annotated by professional ultrasound experts and classified into three quality levels: high, medium, and low. To better evaluate MLLMs, we decompose the quality assessment task into three dimensionalities: qualitative classification, quantitative scoring, and comparative assessment. The evaluation of 7 open-source MLLMs as well as 1 proprietary MLLMs demonstrates that MLLMs possess preliminary capabilities for low-level visual tasks in ultrasound image quality classification. We hope this benchmark will inspire the research community to delve deeper into uncovering and enhancing the untapped potential of MLLMs for medical imaging tasks.
中文摘要:Ultrasound-QBench是一个综合性基准,旨在通过评估多模态大语言模型在超声图像质量分类、评分和比较任务中的表现,帮助临床医生筛选高质量图像以确保诊断准确性,并推动医学影像领域的深入研究。
English Summary: Ultrasound-QBench is a benchmark introduced to evaluate multimodal large language models on ultrasound image quality assessment, aiming to assist clinicians in selecting high-quality images for accurate diagnoses by testing models on datasets with annotated quality levels.

Authors:Suizhi Huang, Xingyi Yang, Hongtao Lu, Xinchao Wang
Title: Few-shot Implicit Function Generation via Equivariance
Abstract:
Implicit Neural Representations (INRs) have emerged as a powerful framework for representing continuous signals. However, generating diverse INR weights remains challenging due to limited training data. We introduce Few-shot Implicit Function Generation, a new problem setup that aims to generate diverse yet functionally consistent INR weights from only a few examples. This is challenging because even for the same signal, the optimal INRs can vary significantly depending on their initializations. To tackle this, we propose EquiGen, a framework that can generate new INRs from limited data. The core idea is that functionally similar networks can be transformed into one another through weight permutations, forming an equivariance group. By projecting these weights into an equivariant latent space, we enable diverse generation within these groups, even with few examples. EquiGen implements this through an equivariant encoder trained via contrastive learning and smooth augmentation, an equivariance-guided diffusion process, and controlled perturbations in the equivariant subspace. Experiments on 2D image and 3D shape INR datasets demonstrate that our approach effectively generates diverse INR weights while preserving their functional properties in few-shot scenarios.
Chinese: 本文提出EquiGen框架,通过利用权重置换等变性和等变潜在空间,从少量样本中生成多样化的隐式神经表示权重,并在2D和3D任务中验证了其有效性。
English: The paper introduces EquiGen, a framework that generates diverse Implicit Neural Representation (INR) weights from few examples by leveraging weight permutation equivariance and an equivariant latent space, demonstrating effectiveness in 2D and 3D tasks.

Authors:Cleverson Nahum, Salvatore D'Oro, Pedro Batista, Cristiano Both, Kleber Cardoso, Aldebaro Klautau, Tommaso Melodia
Title: Intent-based Radio Scheduler for RAN Slicing: Learning to deal with different network scenarios
Abstract:
The future mobile network has the complex mission of distributing available radio resources among various applications with different requirements. The radio access network slicing enables the creation of different logical networks by isolating and using dedicated resources for each group of applications. In this scenario, the radio resource scheduling (RRS) is responsible for distributing the radio resources available among the slices to fulfill their service-level agreement (SLA) requirements, prioritizing critical slices while minimizing the number of intent violations. Moreover, ensuring that the RRS can deal with a high diversity of network scenarios is essential. Several recent papers present advances in machine learning-based RRS. However, the scenarios and slice variety are restricted, which inhibits solid conclusions about the generalization capabilities of the models after deployment in real networks. This paper proposes an intent-based RRS using multi-agent reinforcement learning in a radio access network (RAN) slicing context. The proposed method protects high-priority slices when the available radio resources cannot fulfill all the slices. It uses transfer learning to reduce the number of training steps required. The proposed method and baselines are evaluated in different network scenarios that comprehend combinations of different slice types, channel trajectories, number of active slices and users' equipment (UEs), and UE characteristics. The proposed method outperformed the baselines in protecting slices with higher priority, obtaining an improvement of 40% and, when considering all the slices, obtaining an improvement of 20% in relation to the baselines. The results show that by using transfer learning, the required number of training steps could be reduced by a factor of eight without hurting performance.
中文摘要:本文提出了一种基于意图的多智能体强化学习无线资源调度方法,在无线接入网络切片场景中优先保障高优先级切片,相比基线方法对关键切片的保护效果提升40%,并通过迁移学习将训练步骤减少八倍且不影响性能。
English Summary: This paper introduces an intent-based radio resource scheduling method using multi-agent reinforcement learning to prioritize high-priority network slices, achieving 40% better protection for critical slices and reducing training steps by eightfold through transfer learning.

Authors:Cleverson Nahum, Salvatore D'Oro, Pedro Batista, Cristiano Both, Kleber Cardoso, Aldebaro Klautau, Tommaso Melodia
Title: Intent-based Radio Scheduler for RAN Slicing: Learning to deal with different network scenarios
Abstract:
The future mobile network has the complex mission of distributing available radio resources among various applications with different requirements. The radio access network slicing enables the creation of different logical networks by isolating and using dedicated resources for each group of applications. In this scenario, the radio resource scheduling (RRS) is responsible for distributing the radio resources available among the slices to fulfill their service-level agreement (SLA) requirements, prioritizing critical slices while minimizing the number of intent violations. Moreover, ensuring that the RRS can deal with a high diversity of network scenarios is essential. Several recent papers present advances in machine learning-based RRS. However, the scenarios and slice variety are restricted, which inhibits solid conclusions about the generalization capabilities of the models after deployment in real networks. This paper proposes an intent-based RRS using multi-agent reinforcement learning in a radio access network (RAN) slicing context. The proposed method protects high-priority slices when the available radio resources cannot fulfill all the slices. It uses transfer learning to reduce the number of training steps required. The proposed method and baselines are evaluated in different network scenarios that comprehend combinations of different slice types, channel trajectories, number of active slices and users' equipment (UEs), and UE characteristics. The proposed method outperformed the baselines in protecting slices with higher priority, obtaining an improvement of 40% and, when considering all the slices, obtaining an improvement of 20% in relation to the baselines. The results show that by using transfer learning, the required number of training steps could be reduced by a factor of eight without hurting performance.
中文摘要:本文提出了一种基于意图的多智能体强化学习无线资源调度方法,在无线接入网络切片场景中优先保障高优先级切片,相比基线方法对关键切片的保护效果提升40%,并通过迁移学习将训练步骤减少八倍且不影响性能。
English Summary: This paper introduces an intent-based radio resource scheduling method using multi-agent reinforcement learning to prioritize high-priority network slices, achieving 40% better protection for critical slices and reducing training steps by eightfold through transfer learning.

Authors:Chaoqun Liang, Thomas Benz, Alessandro Ottaviano, Angelo Garofalo, Luca Benini, Davide Rossi
Title: Towards Reliable Systems: A Scalable Approach to AXI4 Transaction Monitoring
Abstract:
In safety-critical SoC applications such as automotive and aerospace, reliable transaction monitoring is crucial for maintaining system integrity. This paper introduces a drop-in Transaction Monitoring Unit (TMU) for AXI4 subordinate endpoints that detects transaction failures including protocol violations or timeouts and triggers recovery by resetting the affected subordinates. Two TMU variants address different constraints: a Tiny-Counter solution for tightly area-constrained systems and a Full-Counter solution for critical subordinates in mixed-criticality SoCs. The Tiny-Counter employs a single counter per outstanding transaction, while the Full-Counter uses multiple counters to track distinct transaction stages, offering finer-grained monitoring and reducing detection latencies by up to hundreds of cycles at roughly 2.5x the area cost. The Full-Counter also provides detailed error logs for performance and bottleneck analysis. Evaluations at both IP and system levels confirm the TMU's effectiveness and low overhead. In GF12 technology, monitoring 16-32 outstanding transactions occupies 1330-2616 um2 for the Tiny-Counter and 3452-6787 um2 for the Full-Counter; moderate prescaler steps reduce these figures by 18-39% and 19-32%, respectively, with no loss of functionality. Results from a full-system integration demonstrate the TMU's robust and precise monitoring capabilities in safety-critical SoC environments.
中文: 本文提出用于AXI4系统的交易监控单元,能检测交易失败并触发恢复,提供精简计数器与完整计数器两种方案,在安全关键SoC中实现面积限制与监控精度的平衡。
English: This paper presents a Transaction Monitoring Unit (TMU) for AXI4 systems that detects transaction failures and triggers recovery, offering Tiny-Counter and Full-Counter variants to balance area constraints with monitoring precision in safety-critical SoCs.

Authors:Shengyao Zhuang, Ekaterina Khramtsova, Xueguang Ma, Bevan Koopman, Jimmy Lin, Guido Zuccon
Title: Document Screenshot Retrievers are Vulnerable to Pixel Poisoning Attacks
Abstract:
Recent advancements in dense retrieval have introduced vision-language model (VLM)-based retrievers, such as DSE and ColPali, which leverage document screenshots embedded as vectors to enable effective search and offer a simplified pipeline over traditional text-only methods. In this study, we propose three pixel poisoning attack methods designed to compromise VLM-based retrievers and evaluate their effectiveness under various attack settings and parameter configurations. Our empirical results demonstrate that injecting even a single adversarial screenshot into the retrieval corpus can significantly disrupt search results, poisoning the top-10 retrieved documents for 41.9% of queries in the case of DSE and 26.4% for ColPali. These vulnerability rates notably exceed those observed with equivalent attacks on text-only retrievers. Moreover, when targeting a small set of known queries, the attack success rate raises, achieving complete success in certain cases. By exposing the vulnerabilities inherent in vision-language models, this work highlights the potential risks associated with their deployment.
Chinese: 近期基于视觉语言模型的密集检索系统易受像素投毒攻击,注入对抗性截图可显著破坏搜索结果,特定模型攻击成功率高达41.9%,远超传统文本检索器。
English: Recent dense retrieval systems using vision-language models are vulnerable to pixel poisoning attacks that can significantly disrupt search results by injecting adversarial screenshots, with success rates up to 41.9% for certain models.

Authors:Marco Huber, Fadi Boutros, Naser Damer
Title: Frequency Matters: Explaining Biases of Face Recognition in the Frequency Domain
Abstract:
Face recognition (FR) models are vulnerable to performance variations across demographic groups. The causes for these performance differences are unclear due to the highly complex deep learning-based structure of face recognition models. Several works aimed at exploring possible roots of gender and ethnicity bias, identifying semantic reasons such as hairstyle, make-up, or facial hair as possible sources. Motivated by recent discoveries of the importance of frequency patterns in convolutional neural networks, we explain bias in face recognition using state-of-the-art frequency-based explanations. Our extensive results show that different frequencies are important to FR models depending on the ethnicity of the samples.
Chinese: 本研究通过先进的频域分析方法揭示,人脸识别模型在不同种族样本中依赖不同的频率模式,这解释了模型性能在人口统计学群体间存在差异的原因。
English: Face recognition models exhibit performance disparities across demographic groups, which this study attributes to varying reliance on frequency patterns depending on ethnicity, as revealed through advanced frequency-based analysis.

Authors:Changze Lv, Yansen Wang, Dongqi Han, Yifei Shen, Xiaoqing Zheng, Xuanjing Huang, Dongsheng Li
Title: Toward Relative Positional Encoding in Spiking Transformers
Abstract:
Spiking neural networks (SNNs) are bio-inspired networks that mimic how neurons in the brain communicate through discrete spikes, which have great potential in various tasks due to their energy efficiency and temporal processing capabilities. SNNs with self-attention mechanisms (spiking Transformers) have recently shown great advancements in various tasks, and inspired by traditional Transformers, several studies have demonstrated that spiking absolute positional encoding can help capture sequential relationships for input data, enhancing the capabilities of spiking Transformers for tasks such as sequential modeling and image classification. However, how to incorporate relative positional information into SNNs remains a challenge. In this paper, we introduce several strategies to approximate relative positional encoding (RPE) in spiking Transformers while preserving the binary nature of spikes. Firstly, we formally prove that encoding relative distances with Gray Code ensures that the binary representations of positional indices maintain a constant Hamming distance whenever their decimal values differ by a power of two, and we propose Gray-PE based on this property. In addition, we propose another RPE method called Log-PE, which combines the logarithmic form of the relative distance matrix directly into the spiking attention map. Furthermore, we extend our RPE methods to a two-dimensional form, making them suitable for processing image patches. We evaluate our RPE methods on various tasks, including time series forecasting, text classification, and patch-based image classification, and the experimental results demonstrate a satisfying performance gain by incorporating our RPE methods across many architectures.
Chinese: 本文提出了在保持脉冲二进制特性的前提下,在脉冲Transformer中实现相对位置编码的新策略,包括基于格雷码的Gray-PE和对数形式的Log-PE方法,这些方法在多项任务中均展现出显著的性能提升。
English: This paper introduces novel strategies for approximating relative positional encoding in spiking Transformers while maintaining binary spike properties, including Gray-PE and Log-PE methods that demonstrate significant performance improvements across various tasks.

Authors:Peilin Yu, Yuwei Wu, Zhi Gao, Xiaomeng Fan, Yunde Jia
Title: Large-Scale Riemannian Meta-Optimization via Subspace Adaptation
Abstract:
Riemannian meta-optimization provides a promising approach to solving non-linear constrained optimization problems, which trains neural networks as optimizers to perform optimization on Riemannian manifolds. However, existing Riemannian meta-optimization methods take up huge memory footprints in large-scale optimization settings, as the learned optimizer can only adapt gradients of a fixed size and thus cannot be shared across different Riemannian parameters. In this paper, we propose an efficient Riemannian meta-optimization method that significantly reduces the memory burden for large-scale optimization via a subspace adaptation scheme. Our method trains neural networks to individually adapt the row and column subspaces of Riemannian gradients, instead of directly adapting the full gradient matrices in existing Riemannian meta-optimization methods. In this case, our learned optimizer can be shared across Riemannian parameters with different sizes. Our method reduces the model memory consumption by six orders of magnitude when optimizing an orthogonal mainstream deep neural network (e.g., ResNet50). Experiments on multiple Riemannian tasks show that our method can not only reduce the memory consumption but also improve the performance of Riemannian meta-optimization.
中文: 本文提出一种高效的黎曼元优化方法,通过适应梯度子空间而非完整矩阵来大幅降低内存消耗,使优化器可在不同参数间共享,并在显著减少内存占用的同时提升性能。
English: This paper introduces an efficient Riemannian meta-optimization method that reduces memory usage by adapting gradient subspaces instead of full matrices, enabling optimizer sharing across parameters and achieving significant memory savings with improved performance.

Authors:Peiqing Yang, Shangchen Zhou, Jixin Zhao, Qingyi Tao, Chen Change Loy
Title: MatAnyone: Stable Video Matting with Consistent Memory Propagation
Abstract:
Auxiliary-free human video matting methods, which rely solely on input frames, often struggle with complex or ambiguous backgrounds. To address this, we propose MatAnyone, a robust framework tailored for target-assigned video matting. Specifically, building on a memory-based paradigm, we introduce a consistent memory propagation module via region-adaptive memory fusion, which adaptively integrates memory from the previous frame. This ensures semantic stability in core regions while preserving fine-grained details along object boundaries. For robust training, we present a larger, high-quality, and diverse dataset for video matting. Additionally, we incorporate a novel training strategy that efficiently leverages large-scale segmentation data, boosting matting stability. With this new network design, dataset, and training strategy, MatAnyone delivers robust and accurate video matting results in diverse real-world scenarios, outperforming existing methods.
中文摘要:MatAnyone是一种无需辅助信息的鲁棒视频抠图框架,通过自适应记忆融合确保核心区域语义稳定性并保留边界细节,结合新数据集和训练策略在多样化场景中实现卓越性能。
English Summary: MatAnyone is a robust auxiliary-free video matting framework that ensures semantic stability and preserves fine details through adaptive memory fusion, supported by a novel dataset and training strategy for superior performance in diverse scenarios.

Authors:Saba Sadeghi Ahouei, Denis Antipov, Aneta Neumann, Frank Neumann
Title: Feature-based Evolutionary Diversity Optimization of Discriminating Instances for Chance-constrained Optimization Problems
Abstract:
Algorithm selection is crucial in the field of optimization, as no single algorithm performs perfectly across all types of optimization problems. Finding the best algorithm among a given set of algorithms for a given problem requires a detailed analysis of the problem's features. To do so, it is important to have a diverse set of benchmarking instances highlighting the difference in algorithms' performance. In this paper, we evolve diverse benchmarking instances for chance-constrained optimization problems that contain stochastic components characterized by their expected values and variances. These instances clearly differentiate the performance of two given algorithms, meaning they are easy to solve by one algorithm and hard to solve by the other. We introduce a $(μ+1)~EA$ for feature-based diversity optimization to evolve such differentiating instances. We study the chance-constrained maximum coverage problem with stochastic weights on the vertices as an example of chance-constrained optimization problems. The experimental results demonstrate that our method successfully generates diverse instances based on different features while effectively distinguishing the performance between a pair of algorithms.
中文: 本文提出了一种基于$(μ+1)$ EA的特征多样性优化方法,用于生成多样化的基准测试实例,这些实例能有效区分算法在机会约束优化问题中的性能表现。
English: This paper introduces a method using a $(μ+1)$ EA for feature-based diversity optimization to evolve diverse benchmarking instances that effectively differentiate the performance of algorithms in chance-constrained optimization problems.

Authors:Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, Michael Choi, Anish Agrawal, Arnav Chopra, Adam Khoja, Ryan Kim, Richard Ren, Jason Hausenloy, Oliver Zhang, Mantas Mazeika, Dmitry Dodonov, Tung Nguyen, Jaeho Lee, Daron Anderson, Mikhail Doroshenko, Alun Cennyth Stokes, Mobeen Mahmood, Oleksandr Pokutnyi, Oleg Iskra, Jessica P. Wang, John-Clark Levin, Mstyslav Kazakov, Fiona Feng, Steven Y. Feng, Haoran Zhao, Michael Yu, Varun Gangal, Chelsea Zou, Zihan Wang, Serguei Popov, Robert Gerbicz, Geoff Galgon, Johannes Schmitt, Will Yeadon, Yongki Lee, Scott Sauers, Alvaro Sanchez, Fabian Giska, Marc Roth, Søren Riis, Saiteja Utpala, Noah Burns, Gashaw M. Goshu, Mohinder Maheshbhai Naiya, Chidozie Agu, Zachary Giboney, Antrell Cheatom, Francesco Fournier-Facio, Sarah-Jane Crowson, Lennart Finke, Zerui Cheng, Jennifer Zampese, Ryan G. Hoerr, Mark Nandor, Hyunwoo Park, Tim Gehrunger, Jiaqi Cai, Ben McCarty, Alexis C Garretson, Edwin Taylor, Damien Sileo, Qiuyu Ren, Usman Qazi, Lianghui Li, Jungbae Nam, John B. Wydallis, Pavel Arkhipov, Jack Wei Lun Shi, Aras Bacho, Chris G. Willcocks, Hangrui Cao, Sumeet Motwani, Emily de Oliveira Santos, Johannes Veith, Edward Vendrow, Doru Cojoc, Kengo Zenitani, Joshua Robinson, Longke Tang, Yuqi Li, Joshua Vendrow, Natanael Wildner Fraga, Vladyslav Kuchkin, Andrey Pupasov Maksimov, Pierre Marion, Denis Efremov, Jayson Lynch, Kaiqu Liang, Aleksandar Mikov, Andrew Gritsevskiy, Julien Guillod, Gözdenur Demir, Dakotah Martinez, Ben Pageler, Kevin Zhou, Saeed Soori, Ori Press, Henry Tang, Paolo Rissone, Sean R. Green, Lina Brüssel, Moon Twayana, Aymeric Dieuleveut, Joseph Marvin Imperial, Ameya Prabhu, Jinzhou Yang, Nick Crispino, Arun Rao, Dimitri Zvonkine, Gabriel Loiseau, Mikhail Kalinin, Marco Lukas, Ciprian Manolescu, Nate Stambaugh, Subrata Mishra, Tad Hogg, Carlo Bosio, Brian P Coppola, Julian Salazar, Jaehyeok Jin, Rafael Sayous, Stefan Ivanov, Philippe Schwaller, Shaipranesh Senthilkuma, Andres M Bran, Andres Algaba, Kelsey Van den Houte, Lynn Van Der Sypt, Brecht Verbeken, David Noever, Alexei Kopylov, Benjamin Myklebust, Bikun Li, Lisa Schut, Evgenii Zheltonozhskii, Qiaochu Yuan, Derek Lim, Richard Stanley, Tong Yang, John Maar, Julian Wykowski, Martí Oller, Anmol Sahu, Cesare Giulio Ardito, Yuzheng Hu, Ariel Ghislain Kemogne Kamdoum, Alvin Jin, Tobias Garcia Vilchis, Yuexuan Zu, Martin Lackner, James Koppel, Gongbo Sun, Daniil S. Antonenko, Steffi Chern, Bingchen Zhao, Pierrot Arsene, Joseph M Cavanagh, Daofeng Li, Jiawei Shen, Donato Crisostomi, Wenjin Zhang, Ali Dehghan, Sergey Ivanov, David Perrella, Nurdin Kaparov, Allen Zang, Ilia Sucholutsky, Arina Kharlamova, Daniil Orel, Vladislav Poritski, Shalev Ben-David, Zachary Berger, Parker Whitfill, Michael Foster, Daniel Munro, Linh Ho, Shankar Sivarajan, Dan Bar Hava, Aleksey Kuchkin, David Holmes, Alexandra Rodriguez-Romero, Frank Sommerhage, Anji Zhang, Richard Moat, Keith Schneider, Zakayo Kazibwe, Don Clarke, Dae Hyun Kim, Felipe Meneguitti Dias, Sara Fish, Veit Elser, Tobias Kreiman, Victor Efren Guadarrama Vilchis, Immo Klose, Ujjwala Anantheswaran, Adam Zweiger, Kaivalya Rawal, Jeffery Li, Jeremy Nguyen, Nicolas Daans, Haline Heidinger, Maksim Radionov, Václav Rozhoň, Vincent Ginis, Christian Stump, Niv Cohen, Rafał Poświata, Josef Tkadlec, Alan Goldfarb, Chenguang Wang, Piotr Padlewski, Stanislaw Barzowski, Kyle Montgomery, Ryan Stendall, Jamie Tucker-Foltz, Jack Stade, T. Ryan Rogers, Tom Goertzen, Declan Grabb, Abhishek Shukla, Alan Givré, John Arnold Ambay, Archan Sen, Muhammad Fayez Aziz, Mark H Inlow, Hao He, Ling Zhang, Younesse Kaddar, Ivar Ängquist, Yanxu Chen, Harrison K Wang, Kalyan Ramakrishnan, Elliott Thornley, Antonio Terpin, Hailey Schoelkopf, Eric Zheng, Avishy Carmi, Ethan D. L. Brown, Kelin Zhu, Max Bartolo, Richard Wheeler, Martin Stehberger, Peter Bradshaw, JP Heimonen, Kaustubh Sridhar, Ido Akov, Jennifer Sandlin, Yury Makarychev, Joanna Tam, Hieu Hoang, David M. Cunningham, Vladimir Goryachev, Demosthenes Patramanis, Michael Krause, Andrew Redenti, David Aldous, Jesyin Lai, Shannon Coleman, Jiangnan Xu, Sangwon Lee, Ilias Magoulas, Sandy Zhao, Ning Tang, Michael K. Cohen, Orr Paradise, Jan Hendrik Kirchner, Maksym Ovchynnikov, Jason O. Matos, Adithya Shenoy, Michael Wang, Yuzhou Nie, Anna Sztyber-Betley, Paolo Faraboschi, Robin Riblet, Jonathan Crozier, Shiv Halasyamani, Shreyas Verma, Prashant Joshi, Eli Meril, Ziqiao Ma, Jérémy Andréoletti, Raghav Singhal, Jacob Platnick, Volodymyr Nevirkovets, Luke Basler, Alexander Ivanov, Seri Khoury, Nils Gustafsson, Marco Piccardo, Hamid Mostaghimi, Qijia Chen, Virendra Singh, Tran Quoc Khánh, Paul Rosu, Hannah Szlyk, Zachary Brown, Himanshu Narayan, Aline Menezes, Jonathan Roberts, William Alley, Kunyang Sun, Arkil Patel, Max Lamparth, Anka Reuel, Linwei Xin, Hanmeng Xu, Jacob Loader, Freddie Martin, Zixuan Wang, Andrea Achilleos, Thomas Preu, Tomek Korbak, Ida Bosio, Fereshteh Kazemi, Ziye Chen, Biró Bálint, Eve J. Y. Lo, Jiaqi Wang, Maria Inês S. Nunes, Jeremiah Milbauer, M Saiful Bari, Zihao Wang, Behzad Ansarinejad, Yewen Sun, Stephane Durand, Hossam Elgnainy, Guillaume Douville, Daniel Tordera, George Balabanian, Hew Wolff, Lynna Kvistad, Hsiaoyun Milliron, Ahmad Sakor, Murat Eron, Andrew Favre D. O., Shailesh Shah, Xiaoxiang Zhou, Firuz Kamalov, Sherwin Abdoli, Tim Santens, Shaul Barkan, Allison Tee, Robin Zhang, Alessandro Tomasiello, G. Bruno De Luca, Shi-Zhuo Looi, Vinh-Kha Le, Noam Kolt, Jiayi Pan, Emma Rodman, Jacob Drori, Carl J Fossum, Niklas Muennighoff, Milind Jagota, Ronak Pradeep, Honglu Fan, Jonathan Eicher, Michael Chen, Kushal Thaman, William Merrill, Moritz Firsching, Carter Harris, Stefan Ciobâcă, Jason Gross, Rohan Pandey, Ilya Gusev, Adam Jones, Shashank Agnihotri, Pavel Zhelnov, Mohammadreza Mofayezi, Alexander Piperski, David K. Zhang, Kostiantyn Dobarskyi, Roman Leventov, Ignat Soroko, Joshua Duersch, Vage Taamazyan, Andrew Ho, Wenjie Ma, William Held, Ruicheng Xian, Armel Randy Zebaze, Mohanad Mohamed, Julian Noah Leser, Michelle X Yuan, Laila Yacar, Johannes Lengler, Katarzyna Olszewska, Claudio Di Fratta, Edson Oliveira, Joseph W. Jackson, Andy Zou, Muthu Chidambaram, Timothy Manik, Hector Haffenden, Dashiell Stander, Ali Dasouqi, Alexander Shen, Bita Golshani, David Stap, Egor Kretov, Mikalai Uzhou, Alina Borisovna Zhidkovskaya, Nick Winter, Miguel Orbegozo Rodriguez, Robert Lauff, Dustin Wehr, Colin Tang, Zaki Hossain, Shaun Phillips, Fortuna Samuele, Fredrik Ekström, Angela Hammon, Oam Patel, Faraz Farhidi, George Medley, Forough Mohammadzadeh, Madellene Peñaflor, Haile Kassahun, Alena Friedrich, Rayner Hernandez Perez, Daniel Pyda, Taom Sakal, Omkar Dhamane, Ali Khajegili Mirabadi, Eric Hallman, Kenchi Okutsu, Mike Battaglia, Mohammad Maghsoudimehrabani, Alon Amit, Dave Hulbert, Roberto Pereira, Simon Weber, Handoko, Anton Peristyy, Stephen Malina, Mustafa Mehkary, Rami Aly, Frank Reidegeld, Anna-Katharina Dick, Cary Friday, Mukhwinder Singh, Hassan Shapourian, Wanyoung Kim, Mariana Costa, Hubeyb Gurdogan, Harsh Kumar, Chiara Ceconello, Chao Zhuang, Haon Park, Micah Carroll, Andrew R. Tawfeek, Stefan Steinerberger, Daattavya Aggarwal, Michael Kirchhof, Linjie Dai, Evan Kim, Johan Ferret, Jainam Shah, Yuzhou Wang, Minghao Yan, Krzysztof Burdzy, Lixin Zhang, Antonio Franca, Diana T. Pham, Kang Yong Loh, Joshua Robinson, Abram Jackson, Paolo Giordano, Philipp Petersen, Adrian Cosma, Jesus Colino, Colin White, Jacob Votava, Vladimir Vinnikov, Ethan Delaney, Petr Spelda, Vit Stritecky, Syed M. Shahid, Jean-Christophe Mourrat, Lavr Vetoshkin, Koen Sponselee, Renas Bacho, Zheng-Xin Yong, Florencia de la Rosa, Nathan Cho, Xiuyu Li, Guillaume Malod, Orion Weller, Guglielmo Albani, Leon Lang, Julien Laurendeau, Dmitry Kazakov, Fatimah Adesanya, Julien Portier, Lawrence Hollom, Victor Souza, Yuchen Anna Zhou, Julien Degorre, Yiğit Yalın, Gbenga Daniel Obikoya, Rai, Filippo Bigi, M. C. Boscá, Oleg Shumar, Kaniuar Bacho, Gabriel Recchia, Mara Popescu, Nikita Shulga, Ngefor Mildred Tanwie, Thomas C. H. Lux, Ben Rank, Colin Ni, Matthew Brooks, Alesia Yakimchyk, Huanxu, Liu, Stefano Cavalleri, Olle Häggström, Emil Verkama, Joshua Newbould, Hans Gundlach, Leonor Brito-Santana, Brian Amaro, Vivek Vajipey, Rynaa Grover, Ting Wang, Yosi Kratish, Wen-Ding Li, Sivakanth Gopi, Andrea Caciolai, Christian Schroeder de Witt, Pablo Hernández-Cámara, Emanuele Rodolà, Jules Robins, Dominic Williamson, Vincent Cheng, Brad Raynor, Hao Qi, Ben Segev, Jingxuan Fan, Sarah Martinson, Erik Y. Wang, Kaylie Hausknecht, Michael P. Brenner, Mao Mao, Christoph Demian, Peyman Kassani, Xinyu Zhang, David Avagian, Eshawn Jessica Scipio, Alon Ragoler, Justin Tan, Blake Sims, Rebeka Plecnik, Aaron Kirtland, Omer Faruk Bodur, D. P. Shinde, Yan Carlos Leyva Labrador, Zahra Adoul, Mohamed Zekry, Ali Karakoc, Tania C. B. Santos, Samir Shamseldeen, Loukmane Karim, Anna Liakhovitskaia, Nate Resman, Nicholas Farina, Juan Carlos Gonzalez, Gabe Maayan, Earth Anderson, Rodrigo De Oliveira Pena, Elizabeth Kelley, Hodjat Mariji, Rasoul Pouriamanesh, Wentao Wu, Ross Finocchio, Ismail Alarab, Joshua Cole, Danyelle Ferreira, Bryan Johnson, Mohammad Safdari, Liangti Dai, Siriphan Arthornthurasuk, Isaac C. McAlister, Alejandro José Moyano, Alexey Pronin, Jing Fan, Angel Ramirez-Trinidad, Yana Malysheva, Daphiny Pottmaier, Omid Taheri, Stanley Stepanic, Samuel Perry, Luke Askew, Raúl Adrián Huerta Rodríguez, Ali M. R. Minissi, Ricardo Lorena, Krishnamurthy Iyer, Arshad Anil Fasiludeen, Ronald Clark, Josh Ducey, Matheus Piza, Maja Somrak, Eric Vergo, Juehang Qin, Benjámin Borbás, Eric Chu, Jack Lindsey, Antoine Jallon, I. M. J. McInnis, Evan Chen, Avi Semler, Luk Gloor, Tej Shah, Marc Carauleanu, Pascal Lauer, Tran Đuc Huy, Hossein Shahrtash, Emilien Duc, Lukas Lewark, Assaf Brown, Samuel Albanie, Brian Weber, Warren S. Vaz, Pierre Clavier, Yiyang Fan, Gabriel Poesia Reis e Silva, Long, Lian, Marcus Abramovitch, Xi Jiang, Sandra Mendoza, Murat Islam, Juan Gonzalez, Vasilios Mavroudis, Justin Xu, Pawan Kumar, Laxman Prasad Goswami, Daniel Bugas, Nasser Heydari, Ferenc Jeanplong, Thorben Jansen, Antonella Pinto, Archimedes Apronti, Abdallah Galal, Ng Ze-An, Ankit Singh, Tong Jiang, Joan of Arc Xavier, Kanu Priya Agarwal, Mohammed Berkani, Gang Zhang, Zhehang Du, Benedito Alves de Oliveira Junior, Dmitry Malishev, Nicolas Remy, Taylor D. Hartman, Tim Tarver, Stephen Mensah, Gautier Abou Loume, Wiktor Morak, Farzad Habibi, Sarah Hoback, Will Cai, Javier Gimenez, Roselynn Grace Montecillo, Jakub Łucki, Russell Campbell, Asankhaya Sharma, Khalida Meer, Shreen Gul, Daniel Espinosa Gonzalez, Xavier Alapont, Alex Hoover, Gunjan Chhablani, Freddie Vargus, Arunim Agarwal, Yibo Jiang, Deepakkumar Patil, David Outevsky, Kevin Joseph Scaria, Rajat Maheshwari, Abdelkader Dendane, Priti Shukla, Ashley Cartwright, Sergei Bogdanov, Niels Mündler, Sören Möller, Luca Arnaboldi, Kunvar Thaman, Muhammad Rehan Siddiqi, Prajvi Saxena, Himanshu Gupta, Tony Fruhauff, Glen Sherman, Mátyás Vincze, Siranut Usawasutsakorn, Dylan Ler, Anil Radhakrishnan, Innocent Enyekwe, Sk Md Salauddin, Jiang Muzhen, Aleksandr Maksapetyan, Vivien Rossbach, Chris Harjadi, Mohsen Bahaloohoreh, Claire Sparrow, Jasdeep Sidhu, Sam Ali, Song Bian, John Lai, Eric Singer, Justine Leon Uro, Greg Bateman, Mohamed Sayed, Ahmed Menshawy, Darling Duclosel, Dario Bezzi, Yashaswini Jain, Ashley Aaron, Murat Tiryakioglu, Sheeshram Siddh, Keith Krenek, Imad Ali Shah, Jun Jin, Scott Creighton, Denis Peskoff, Zienab EL-Wasif, Ragavendran P, Michael Richmond, Joseph McGowan, Tejal Patwardhan, Hao-Yu Sun, Ting Sun, Nikola Zubić, Samuele Sala, Stephen Ebert, Jean Kaddour, Manuel Schottdorf, Dianzhuo Wang, Gerol Petruzella, Alex Meiburg, Tilen Medved, Ali ElSheikh, S Ashwin Hebbar, Lorenzo Vaquero, Xianjun Yang, Jason Poulos, Vilém Zouhar, Sergey Bogdanik, Mingfang Zhang, Jorge Sanz-Ros, David Anugraha, Yinwei Dai, Anh N. Nhu, Xue Wang, Ali Anil Demircali, Zhibai Jia, Yuyin Zhou, Juncheng Wu, Mike He, Nitin Chandok, Aarush Sinha, Gaoxiang Luo, Long Le, Mickaël Noyé, Michał Perełkiewicz, Ioannis Pantidis, Tianbo Qi, Soham Sachin Purohit, Letitia Parcalabescu, Thai-Hoa Nguyen, Genta Indra Winata, Edoardo M. Ponti, Hanchen Li, Kaustubh Dhole, Jongee Park, Dario Abbondanza, Yuanli Wang, Anupam Nayak, Diogo M. Caetano, Antonio A. W. L. Wong, Maria del Rio-Chanona, Dániel Kondor, Pieter Francois, Ed Chalstrey, Jakob Zsambok, Dan Hoyer, Jenny Reddish, Jakob Hauser, Francisco-Javier Rodrigo-Ginés, Suchandra Datta, Maxwell Shepherd, Thom Kamphuis, Qizheng Zhang, Hyunjun Kim, Ruiji Sun, Jianzhu Yao, Franck Dernoncourt, Satyapriya Krishna, Sina Rismanchian, Bonan Pu, Francesco Pinto, Yingheng Wang, Kumar Shridhar, Kalon J. Overholt, Glib Briia, Hieu Nguyen, David, Soler Bartomeu, Tony CY Pang, Adam Wecker, Yifan Xiong, Fanfei Li, Lukas S. Huber, Joshua Jaeger, Romano De Maddalena, Xing Han Lù, Yuhui Zhang, Claas Beger, Patrick Tser Jern Kon, Sean Li, Vivek Sanker, Ming Yin, Yihao Liang, Xinlu Zhang, Ankit Agrawal, Li S. Yifei, Zechen Zhang, Mu Cai, Yasin Sonmez, Costin Cozianu, Changhao Li, Alex Slen, Shoubin Yu, Hyun Kyu Park, Gabriele Sarti, Marcin Briański, Alessandro Stolfo, Truong An Nguyen, Mike Zhang, Yotam Perlitz, Jose Hernandez-Orallo, Runjia Li, Amin Shabani, Felix Juefei-Xu, Shikhar Dhingra, Orr Zohar, My Chiffon Nguyen, Alexander Pondaven, Abdurrahim Yilmaz, Xuandong Zhao, Chuanyang Jin, Muyan Jiang, Stefan Todoran, Xinyao Han, Jules Kreuer, Brian Rabern, Anna Plassart, Martino Maggetti, Luther Yap, Robert Geirhos, Jonathon Kean, Dingsu Wang, Sina Mollaei, Chenkai Sun, Yifan Yin, Shiqi Wang, Rui Li, Yaowen Chang, Anjiang Wei, Alice Bizeul, Xiaohan Wang, Alexandre Oliveira Arrais, Kushin Mukherjee, Jorge Chamorro-Padial, Jiachen Liu, Xingyu Qu, Junyi Guan, Adam Bouyamourn, Shuyu Wu, Martyna Plomecka, Junda Chen, Mengze Tang, Jiaqi Deng, Shreyas Subramanian, Haocheng Xi, Haoxuan Chen, Weizhi Zhang, Yinuo Ren, Haoqin Tu, Sejong Kim, Yushun Chen, Sara Vera Marjanović, Junwoo Ha, Grzegorz Luczyna, Jeff J. Ma, Zewen Shen, Dawn Song, Cedegao E. Zhang, Zhun Wang, Gaël Gendron, Yunze Xiao, Leo Smucker, Erica Weng, Kwok Hao Lee, Zhe Ye, Stefano Ermon, Ignacio D. Lopez-Miguel, Theo Knights, Anthony Gitter, Namkyu Park, Boyi Wei, Hongzheng Chen, Kunal Pai, Ahmed Elkhanany, Han Lin, Philipp D. Siedler, Jichao Fang, Ritwik Mishra, Károly Zsolnai-Fehér, Xilin Jiang, Shadab Khan, Jun Yuan, Rishab Kumar Jain, Xi Lin, Mike Peterson, Zhe Wang, Aditya Malusare, Maosen Tang, Isha Gupta, Ivan Fosin, Timothy Kang, Barbara Dworakowska, Kazuki Matsumoto, Guangyao Zheng, Gerben Sewuster, Jorge Pretel Villanueva, Ivan Rannev, Igor Chernyavsky, Jiale Chen, Deepayan Banik, Ben Racz, Wenchao Dong, Jianxin Wang, Laila Bashmal, Duarte V. Gonçalves, Wei Hu, Kaushik Bar, Ondrej Bohdal, Atharv Singh Patlan, Shehzaad Dhuliawala, Caroline Geirhos, Julien Wist, Yuval Kansal, Bingsen Chen, Kutay Tire, Atak Talay Yücel, Brandon Christof, Veerupaksh Singla, Zijian Song, Sanxing Chen, Jiaxin Ge, Kaustubh Ponkshe, Isaac Park, Tianneng Shi, Martin Q. Ma, Joshua Mak, Sherwin Lai, Antoine Moulin, Zhuo Cheng, Zhanda Zhu, Ziyi Zhang, Vaidehi Patil, Ketan Jha, Qiutong Men, Jiaxuan Wu, Tianchi Zhang, Bruno Hebling Vieira, Alham Fikri Aji, Jae-Won Chung, Mohammed Mahfoud, Ha Thi Hoang, Marc Sperzel, Wei Hao, Kristof Meding, Sihan Xu, Vassilis Kostakos, Davide Manini, Yueying Liu, Christopher Toukmaji, Jay Paek, Eunmi Yu, Arif Engin Demircali, Zhiyi Sun, Ivan Dewerpe, Hongsen Qin, Roman Pflugfelder, James Bailey, Johnathan Morris, Ville Heilala, Sybille Rosset, Zishun Yu, Peter E. Chen, Woongyeong Yeo, Eeshaan Jain, Ryan Yang, Sreekar Chigurupati, Julia Chernyavsky, Sai Prajwal Reddy, Subhashini Venugopalan, Hunar Batra, Core Francisco Park, Hieu Tran, Guilherme Maximiano, Genghan Zhang, Yizhuo Liang, Hu Shiyu, Rongwu Xu, Rui Pan, Siddharth Suresh, Ziqi Liu, Samaksh Gulati, Songyang Zhang, Peter Turchin, Christopher W. Bartlett, Christopher R. Scotese, Phuong M. Cao, Aakaash Nattanmai, Gordon McKellips, Anish Cheraku, Asim Suhail, Ethan Luo, Marvin Deng, Jason Luo, Ashley Zhang, Kavin Jindel, Jay Paek, Kasper Halevy, Allen Baranov, Michael Liu, Advaith Avadhanam, David Zhang, Vincent Cheng, Brad Ma, Evan Fu, Liam Do, Joshua Lass, Hubert Yang, Surya Sunkari, Vishruth Bharath, Violet Ai, James Leung, Rishit Agrawal, Alan Zhou, Kevin Chen, Tejas Kalpathi, Ziqi Xu, Gavin Wang, Tyler Xiao, Erik Maung, Sam Lee, Ryan Yang, Roy Yue, Ben Zhao, Julia Yoon, Sunny Sun, Aryan Singh, Ethan Luo, Clark Peng, Tyler Osbey, Taozhi Wang, Daryl Echeazu, Hubert Yang, Timothy Wu, Spandan Patel, Vidhi Kulkarni, Vijaykaarti Sundarapandiyan, Ashley Zhang, Andrew Le, Zafir Nasim, Srikar Yalam, Ritesh Kasamsetty, Soham Samal, Hubert Yang, David Sun, Nihar Shah, Abhijeet Saha, Alex Zhang, Leon Nguyen, Laasya Nagumalli, Kaixin Wang, Alan Zhou, Aidan Wu, Jason Luo, Anwith Telluri, Summer Yue, Alexandr Wang, Dan Hendrycks
Title: Humanity's Last Exam
Abstract:
Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. HLE consists of 2,500 questions across dozens of subjects, including mathematics, humanities, and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading. Each question has a known solution that is unambiguous and easily verifiable, but cannot be quickly answered via internet retrieval. State-of-the-art LLMs demonstrate low accuracy and calibration on HLE, highlighting a significant gap between current LLM capabilities and the expert human frontier on closed-ended academic questions. To inform research and policymaking upon a clear understanding of model capabilities, we publicly release HLE at https://lastexam.ai.
Chinese: 针对现有基准无法有效衡量先进大语言模型能力的问题,我们推出了"人类终极考试"(HLE)这一处于人类知识前沿的多模态基准,该基准揭示了顶尖模型与人类专家水平之间存在的显著能力差距。
English: To address the limitations of existing benchmarks in measuring advanced LLM capabilities, we introduce Humanity's Last Exam (HLE), a multimodal benchmark at the frontier of human knowledge that reveals significant performance gaps between state-of-the-art models and human expertise.

Authors:Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, Michael Choi, Anish Agrawal, Arnav Chopra, Adam Khoja, Ryan Kim, Richard Ren, Jason Hausenloy, Oliver Zhang, Mantas Mazeika, Dmitry Dodonov, Tung Nguyen, Jaeho Lee, Daron Anderson, Mikhail Doroshenko, Alun Cennyth Stokes, Mobeen Mahmood, Oleksandr Pokutnyi, Oleg Iskra, Jessica P. Wang, John-Clark Levin, Mstyslav Kazakov, Fiona Feng, Steven Y. Feng, Haoran Zhao, Michael Yu, Varun Gangal, Chelsea Zou, Zihan Wang, Serguei Popov, Robert Gerbicz, Geoff Galgon, Johannes Schmitt, Will Yeadon, Yongki Lee, Scott Sauers, Alvaro Sanchez, Fabian Giska, Marc Roth, Søren Riis, Saiteja Utpala, Noah Burns, Gashaw M. Goshu, Mohinder Maheshbhai Naiya, Chidozie Agu, Zachary Giboney, Antrell Cheatom, Francesco Fournier-Facio, Sarah-Jane Crowson, Lennart Finke, Zerui Cheng, Jennifer Zampese, Ryan G. Hoerr, Mark Nandor, Hyunwoo Park, Tim Gehrunger, Jiaqi Cai, Ben McCarty, Alexis C Garretson, Edwin Taylor, Damien Sileo, Qiuyu Ren, Usman Qazi, Lianghui Li, Jungbae Nam, John B. Wydallis, Pavel Arkhipov, Jack Wei Lun Shi, Aras Bacho, Chris G. Willcocks, Hangrui Cao, Sumeet Motwani, Emily de Oliveira Santos, Johannes Veith, Edward Vendrow, Doru Cojoc, Kengo Zenitani, Joshua Robinson, Longke Tang, Yuqi Li, Joshua Vendrow, Natanael Wildner Fraga, Vladyslav Kuchkin, Andrey Pupasov Maksimov, Pierre Marion, Denis Efremov, Jayson Lynch, Kaiqu Liang, Aleksandar Mikov, Andrew Gritsevskiy, Julien Guillod, Gözdenur Demir, Dakotah Martinez, Ben Pageler, Kevin Zhou, Saeed Soori, Ori Press, Henry Tang, Paolo Rissone, Sean R. Green, Lina Brüssel, Moon Twayana, Aymeric Dieuleveut, Joseph Marvin Imperial, Ameya Prabhu, Jinzhou Yang, Nick Crispino, Arun Rao, Dimitri Zvonkine, Gabriel Loiseau, Mikhail Kalinin, Marco Lukas, Ciprian Manolescu, Nate Stambaugh, Subrata Mishra, Tad Hogg, Carlo Bosio, Brian P Coppola, Julian Salazar, Jaehyeok Jin, Rafael Sayous, Stefan Ivanov, Philippe Schwaller, Shaipranesh Senthilkuma, Andres M Bran, Andres Algaba, Kelsey Van den Houte, Lynn Van Der Sypt, Brecht Verbeken, David Noever, Alexei Kopylov, Benjamin Myklebust, Bikun Li, Lisa Schut, Evgenii Zheltonozhskii, Qiaochu Yuan, Derek Lim, Richard Stanley, Tong Yang, John Maar, Julian Wykowski, Martí Oller, Anmol Sahu, Cesare Giulio Ardito, Yuzheng Hu, Ariel Ghislain Kemogne Kamdoum, Alvin Jin, Tobias Garcia Vilchis, Yuexuan Zu, Martin Lackner, James Koppel, Gongbo Sun, Daniil S. Antonenko, Steffi Chern, Bingchen Zhao, Pierrot Arsene, Joseph M Cavanagh, Daofeng Li, Jiawei Shen, Donato Crisostomi, Wenjin Zhang, Ali Dehghan, Sergey Ivanov, David Perrella, Nurdin Kaparov, Allen Zang, Ilia Sucholutsky, Arina Kharlamova, Daniil Orel, Vladislav Poritski, Shalev Ben-David, Zachary Berger, Parker Whitfill, Michael Foster, Daniel Munro, Linh Ho, Shankar Sivarajan, Dan Bar Hava, Aleksey Kuchkin, David Holmes, Alexandra Rodriguez-Romero, Frank Sommerhage, Anji Zhang, Richard Moat, Keith Schneider, Zakayo Kazibwe, Don Clarke, Dae Hyun Kim, Felipe Meneguitti Dias, Sara Fish, Veit Elser, Tobias Kreiman, Victor Efren Guadarrama Vilchis, Immo Klose, Ujjwala Anantheswaran, Adam Zweiger, Kaivalya Rawal, Jeffery Li, Jeremy Nguyen, Nicolas Daans, Haline Heidinger, Maksim Radionov, Václav Rozhoň, Vincent Ginis, Christian Stump, Niv Cohen, Rafał Poświata, Josef Tkadlec, Alan Goldfarb, Chenguang Wang, Piotr Padlewski, Stanislaw Barzowski, Kyle Montgomery, Ryan Stendall, Jamie Tucker-Foltz, Jack Stade, T. Ryan Rogers, Tom Goertzen, Declan Grabb, Abhishek Shukla, Alan Givré, John Arnold Ambay, Archan Sen, Muhammad Fayez Aziz, Mark H Inlow, Hao He, Ling Zhang, Younesse Kaddar, Ivar Ängquist, Yanxu Chen, Harrison K Wang, Kalyan Ramakrishnan, Elliott Thornley, Antonio Terpin, Hailey Schoelkopf, Eric Zheng, Avishy Carmi, Ethan D. L. Brown, Kelin Zhu, Max Bartolo, Richard Wheeler, Martin Stehberger, Peter Bradshaw, JP Heimonen, Kaustubh Sridhar, Ido Akov, Jennifer Sandlin, Yury Makarychev, Joanna Tam, Hieu Hoang, David M. Cunningham, Vladimir Goryachev, Demosthenes Patramanis, Michael Krause, Andrew Redenti, David Aldous, Jesyin Lai, Shannon Coleman, Jiangnan Xu, Sangwon Lee, Ilias Magoulas, Sandy Zhao, Ning Tang, Michael K. Cohen, Orr Paradise, Jan Hendrik Kirchner, Maksym Ovchynnikov, Jason O. Matos, Adithya Shenoy, Michael Wang, Yuzhou Nie, Anna Sztyber-Betley, Paolo Faraboschi, Robin Riblet, Jonathan Crozier, Shiv Halasyamani, Shreyas Verma, Prashant Joshi, Eli Meril, Ziqiao Ma, Jérémy Andréoletti, Raghav Singhal, Jacob Platnick, Volodymyr Nevirkovets, Luke Basler, Alexander Ivanov, Seri Khoury, Nils Gustafsson, Marco Piccardo, Hamid Mostaghimi, Qijia Chen, Virendra Singh, Tran Quoc Khánh, Paul Rosu, Hannah Szlyk, Zachary Brown, Himanshu Narayan, Aline Menezes, Jonathan Roberts, William Alley, Kunyang Sun, Arkil Patel, Max Lamparth, Anka Reuel, Linwei Xin, Hanmeng Xu, Jacob Loader, Freddie Martin, Zixuan Wang, Andrea Achilleos, Thomas Preu, Tomek Korbak, Ida Bosio, Fereshteh Kazemi, Ziye Chen, Biró Bálint, Eve J. Y. Lo, Jiaqi Wang, Maria Inês S. Nunes, Jeremiah Milbauer, M Saiful Bari, Zihao Wang, Behzad Ansarinejad, Yewen Sun, Stephane Durand, Hossam Elgnainy, Guillaume Douville, Daniel Tordera, George Balabanian, Hew Wolff, Lynna Kvistad, Hsiaoyun Milliron, Ahmad Sakor, Murat Eron, Andrew Favre D. O., Shailesh Shah, Xiaoxiang Zhou, Firuz Kamalov, Sherwin Abdoli, Tim Santens, Shaul Barkan, Allison Tee, Robin Zhang, Alessandro Tomasiello, G. Bruno De Luca, Shi-Zhuo Looi, Vinh-Kha Le, Noam Kolt, Jiayi Pan, Emma Rodman, Jacob Drori, Carl J Fossum, Niklas Muennighoff, Milind Jagota, Ronak Pradeep, Honglu Fan, Jonathan Eicher, Michael Chen, Kushal Thaman, William Merrill, Moritz Firsching, Carter Harris, Stefan Ciobâcă, Jason Gross, Rohan Pandey, Ilya Gusev, Adam Jones, Shashank Agnihotri, Pavel Zhelnov, Mohammadreza Mofayezi, Alexander Piperski, David K. Zhang, Kostiantyn Dobarskyi, Roman Leventov, Ignat Soroko, Joshua Duersch, Vage Taamazyan, Andrew Ho, Wenjie Ma, William Held, Ruicheng Xian, Armel Randy Zebaze, Mohanad Mohamed, Julian Noah Leser, Michelle X Yuan, Laila Yacar, Johannes Lengler, Katarzyna Olszewska, Claudio Di Fratta, Edson Oliveira, Joseph W. Jackson, Andy Zou, Muthu Chidambaram, Timothy Manik, Hector Haffenden, Dashiell Stander, Ali Dasouqi, Alexander Shen, Bita Golshani, David Stap, Egor Kretov, Mikalai Uzhou, Alina Borisovna Zhidkovskaya, Nick Winter, Miguel Orbegozo Rodriguez, Robert Lauff, Dustin Wehr, Colin Tang, Zaki Hossain, Shaun Phillips, Fortuna Samuele, Fredrik Ekström, Angela Hammon, Oam Patel, Faraz Farhidi, George Medley, Forough Mohammadzadeh, Madellene Peñaflor, Haile Kassahun, Alena Friedrich, Rayner Hernandez Perez, Daniel Pyda, Taom Sakal, Omkar Dhamane, Ali Khajegili Mirabadi, Eric Hallman, Kenchi Okutsu, Mike Battaglia, Mohammad Maghsoudimehrabani, Alon Amit, Dave Hulbert, Roberto Pereira, Simon Weber, Handoko, Anton Peristyy, Stephen Malina, Mustafa Mehkary, Rami Aly, Frank Reidegeld, Anna-Katharina Dick, Cary Friday, Mukhwinder Singh, Hassan Shapourian, Wanyoung Kim, Mariana Costa, Hubeyb Gurdogan, Harsh Kumar, Chiara Ceconello, Chao Zhuang, Haon Park, Micah Carroll, Andrew R. Tawfeek, Stefan Steinerberger, Daattavya Aggarwal, Michael Kirchhof, Linjie Dai, Evan Kim, Johan Ferret, Jainam Shah, Yuzhou Wang, Minghao Yan, Krzysztof Burdzy, Lixin Zhang, Antonio Franca, Diana T. Pham, Kang Yong Loh, Joshua Robinson, Abram Jackson, Paolo Giordano, Philipp Petersen, Adrian Cosma, Jesus Colino, Colin White, Jacob Votava, Vladimir Vinnikov, Ethan Delaney, Petr Spelda, Vit Stritecky, Syed M. Shahid, Jean-Christophe Mourrat, Lavr Vetoshkin, Koen Sponselee, Renas Bacho, Zheng-Xin Yong, Florencia de la Rosa, Nathan Cho, Xiuyu Li, Guillaume Malod, Orion Weller, Guglielmo Albani, Leon Lang, Julien Laurendeau, Dmitry Kazakov, Fatimah Adesanya, Julien Portier, Lawrence Hollom, Victor Souza, Yuchen Anna Zhou, Julien Degorre, Yiğit Yalın, Gbenga Daniel Obikoya, Rai, Filippo Bigi, M. C. Boscá, Oleg Shumar, Kaniuar Bacho, Gabriel Recchia, Mara Popescu, Nikita Shulga, Ngefor Mildred Tanwie, Thomas C. H. Lux, Ben Rank, Colin Ni, Matthew Brooks, Alesia Yakimchyk, Huanxu, Liu, Stefano Cavalleri, Olle Häggström, Emil Verkama, Joshua Newbould, Hans Gundlach, Leonor Brito-Santana, Brian Amaro, Vivek Vajipey, Rynaa Grover, Ting Wang, Yosi Kratish, Wen-Ding Li, Sivakanth Gopi, Andrea Caciolai, Christian Schroeder de Witt, Pablo Hernández-Cámara, Emanuele Rodolà, Jules Robins, Dominic Williamson, Vincent Cheng, Brad Raynor, Hao Qi, Ben Segev, Jingxuan Fan, Sarah Martinson, Erik Y. Wang, Kaylie Hausknecht, Michael P. Brenner, Mao Mao, Christoph Demian, Peyman Kassani, Xinyu Zhang, David Avagian, Eshawn Jessica Scipio, Alon Ragoler, Justin Tan, Blake Sims, Rebeka Plecnik, Aaron Kirtland, Omer Faruk Bodur, D. P. Shinde, Yan Carlos Leyva Labrador, Zahra Adoul, Mohamed Zekry, Ali Karakoc, Tania C. B. Santos, Samir Shamseldeen, Loukmane Karim, Anna Liakhovitskaia, Nate Resman, Nicholas Farina, Juan Carlos Gonzalez, Gabe Maayan, Earth Anderson, Rodrigo De Oliveira Pena, Elizabeth Kelley, Hodjat Mariji, Rasoul Pouriamanesh, Wentao Wu, Ross Finocchio, Ismail Alarab, Joshua Cole, Danyelle Ferreira, Bryan Johnson, Mohammad Safdari, Liangti Dai, Siriphan Arthornthurasuk, Isaac C. McAlister, Alejandro José Moyano, Alexey Pronin, Jing Fan, Angel Ramirez-Trinidad, Yana Malysheva, Daphiny Pottmaier, Omid Taheri, Stanley Stepanic, Samuel Perry, Luke Askew, Raúl Adrián Huerta Rodríguez, Ali M. R. Minissi, Ricardo Lorena, Krishnamurthy Iyer, Arshad Anil Fasiludeen, Ronald Clark, Josh Ducey, Matheus Piza, Maja Somrak, Eric Vergo, Juehang Qin, Benjámin Borbás, Eric Chu, Jack Lindsey, Antoine Jallon, I. M. J. McInnis, Evan Chen, Avi Semler, Luk Gloor, Tej Shah, Marc Carauleanu, Pascal Lauer, Tran Đuc Huy, Hossein Shahrtash, Emilien Duc, Lukas Lewark, Assaf Brown, Samuel Albanie, Brian Weber, Warren S. Vaz, Pierre Clavier, Yiyang Fan, Gabriel Poesia Reis e Silva, Long, Lian, Marcus Abramovitch, Xi Jiang, Sandra Mendoza, Murat Islam, Juan Gonzalez, Vasilios Mavroudis, Justin Xu, Pawan Kumar, Laxman Prasad Goswami, Daniel Bugas, Nasser Heydari, Ferenc Jeanplong, Thorben Jansen, Antonella Pinto, Archimedes Apronti, Abdallah Galal, Ng Ze-An, Ankit Singh, Tong Jiang, Joan of Arc Xavier, Kanu Priya Agarwal, Mohammed Berkani, Gang Zhang, Zhehang Du, Benedito Alves de Oliveira Junior, Dmitry Malishev, Nicolas Remy, Taylor D. Hartman, Tim Tarver, Stephen Mensah, Gautier Abou Loume, Wiktor Morak, Farzad Habibi, Sarah Hoback, Will Cai, Javier Gimenez, Roselynn Grace Montecillo, Jakub Łucki, Russell Campbell, Asankhaya Sharma, Khalida Meer, Shreen Gul, Daniel Espinosa Gonzalez, Xavier Alapont, Alex Hoover, Gunjan Chhablani, Freddie Vargus, Arunim Agarwal, Yibo Jiang, Deepakkumar Patil, David Outevsky, Kevin Joseph Scaria, Rajat Maheshwari, Abdelkader Dendane, Priti Shukla, Ashley Cartwright, Sergei Bogdanov, Niels Mündler, Sören Möller, Luca Arnaboldi, Kunvar Thaman, Muhammad Rehan Siddiqi, Prajvi Saxena, Himanshu Gupta, Tony Fruhauff, Glen Sherman, Mátyás Vincze, Siranut Usawasutsakorn, Dylan Ler, Anil Radhakrishnan, Innocent Enyekwe, Sk Md Salauddin, Jiang Muzhen, Aleksandr Maksapetyan, Vivien Rossbach, Chris Harjadi, Mohsen Bahaloohoreh, Claire Sparrow, Jasdeep Sidhu, Sam Ali, Song Bian, John Lai, Eric Singer, Justine Leon Uro, Greg Bateman, Mohamed Sayed, Ahmed Menshawy, Darling Duclosel, Dario Bezzi, Yashaswini Jain, Ashley Aaron, Murat Tiryakioglu, Sheeshram Siddh, Keith Krenek, Imad Ali Shah, Jun Jin, Scott Creighton, Denis Peskoff, Zienab EL-Wasif, Ragavendran P, Michael Richmond, Joseph McGowan, Tejal Patwardhan, Hao-Yu Sun, Ting Sun, Nikola Zubić, Samuele Sala, Stephen Ebert, Jean Kaddour, Manuel Schottdorf, Dianzhuo Wang, Gerol Petruzella, Alex Meiburg, Tilen Medved, Ali ElSheikh, S Ashwin Hebbar, Lorenzo Vaquero, Xianjun Yang, Jason Poulos, Vilém Zouhar, Sergey Bogdanik, Mingfang Zhang, Jorge Sanz-Ros, David Anugraha, Yinwei Dai, Anh N. Nhu, Xue Wang, Ali Anil Demircali, Zhibai Jia, Yuyin Zhou, Juncheng Wu, Mike He, Nitin Chandok, Aarush Sinha, Gaoxiang Luo, Long Le, Mickaël Noyé, Michał Perełkiewicz, Ioannis Pantidis, Tianbo Qi, Soham Sachin Purohit, Letitia Parcalabescu, Thai-Hoa Nguyen, Genta Indra Winata, Edoardo M. Ponti, Hanchen Li, Kaustubh Dhole, Jongee Park, Dario Abbondanza, Yuanli Wang, Anupam Nayak, Diogo M. Caetano, Antonio A. W. L. Wong, Maria del Rio-Chanona, Dániel Kondor, Pieter Francois, Ed Chalstrey, Jakob Zsambok, Dan Hoyer, Jenny Reddish, Jakob Hauser, Francisco-Javier Rodrigo-Ginés, Suchandra Datta, Maxwell Shepherd, Thom Kamphuis, Qizheng Zhang, Hyunjun Kim, Ruiji Sun, Jianzhu Yao, Franck Dernoncourt, Satyapriya Krishna, Sina Rismanchian, Bonan Pu, Francesco Pinto, Yingheng Wang, Kumar Shridhar, Kalon J. Overholt, Glib Briia, Hieu Nguyen, David, Soler Bartomeu, Tony CY Pang, Adam Wecker, Yifan Xiong, Fanfei Li, Lukas S. Huber, Joshua Jaeger, Romano De Maddalena, Xing Han Lù, Yuhui Zhang, Claas Beger, Patrick Tser Jern Kon, Sean Li, Vivek Sanker, Ming Yin, Yihao Liang, Xinlu Zhang, Ankit Agrawal, Li S. Yifei, Zechen Zhang, Mu Cai, Yasin Sonmez, Costin Cozianu, Changhao Li, Alex Slen, Shoubin Yu, Hyun Kyu Park, Gabriele Sarti, Marcin Briański, Alessandro Stolfo, Truong An Nguyen, Mike Zhang, Yotam Perlitz, Jose Hernandez-Orallo, Runjia Li, Amin Shabani, Felix Juefei-Xu, Shikhar Dhingra, Orr Zohar, My Chiffon Nguyen, Alexander Pondaven, Abdurrahim Yilmaz, Xuandong Zhao, Chuanyang Jin, Muyan Jiang, Stefan Todoran, Xinyao Han, Jules Kreuer, Brian Rabern, Anna Plassart, Martino Maggetti, Luther Yap, Robert Geirhos, Jonathon Kean, Dingsu Wang, Sina Mollaei, Chenkai Sun, Yifan Yin, Shiqi Wang, Rui Li, Yaowen Chang, Anjiang Wei, Alice Bizeul, Xiaohan Wang, Alexandre Oliveira Arrais, Kushin Mukherjee, Jorge Chamorro-Padial, Jiachen Liu, Xingyu Qu, Junyi Guan, Adam Bouyamourn, Shuyu Wu, Martyna Plomecka, Junda Chen, Mengze Tang, Jiaqi Deng, Shreyas Subramanian, Haocheng Xi, Haoxuan Chen, Weizhi Zhang, Yinuo Ren, Haoqin Tu, Sejong Kim, Yushun Chen, Sara Vera Marjanović, Junwoo Ha, Grzegorz Luczyna, Jeff J. Ma, Zewen Shen, Dawn Song, Cedegao E. Zhang, Zhun Wang, Gaël Gendron, Yunze Xiao, Leo Smucker, Erica Weng, Kwok Hao Lee, Zhe Ye, Stefano Ermon, Ignacio D. Lopez-Miguel, Theo Knights, Anthony Gitter, Namkyu Park, Boyi Wei, Hongzheng Chen, Kunal Pai, Ahmed Elkhanany, Han Lin, Philipp D. Siedler, Jichao Fang, Ritwik Mishra, Károly Zsolnai-Fehér, Xilin Jiang, Shadab Khan, Jun Yuan, Rishab Kumar Jain, Xi Lin, Mike Peterson, Zhe Wang, Aditya Malusare, Maosen Tang, Isha Gupta, Ivan Fosin, Timothy Kang, Barbara Dworakowska, Kazuki Matsumoto, Guangyao Zheng, Gerben Sewuster, Jorge Pretel Villanueva, Ivan Rannev, Igor Chernyavsky, Jiale Chen, Deepayan Banik, Ben Racz, Wenchao Dong, Jianxin Wang, Laila Bashmal, Duarte V. Gonçalves, Wei Hu, Kaushik Bar, Ondrej Bohdal, Atharv Singh Patlan, Shehzaad Dhuliawala, Caroline Geirhos, Julien Wist, Yuval Kansal, Bingsen Chen, Kutay Tire, Atak Talay Yücel, Brandon Christof, Veerupaksh Singla, Zijian Song, Sanxing Chen, Jiaxin Ge, Kaustubh Ponkshe, Isaac Park, Tianneng Shi, Martin Q. Ma, Joshua Mak, Sherwin Lai, Antoine Moulin, Zhuo Cheng, Zhanda Zhu, Ziyi Zhang, Vaidehi Patil, Ketan Jha, Qiutong Men, Jiaxuan Wu, Tianchi Zhang, Bruno Hebling Vieira, Alham Fikri Aji, Jae-Won Chung, Mohammed Mahfoud, Ha Thi Hoang, Marc Sperzel, Wei Hao, Kristof Meding, Sihan Xu, Vassilis Kostakos, Davide Manini, Yueying Liu, Christopher Toukmaji, Jay Paek, Eunmi Yu, Arif Engin Demircali, Zhiyi Sun, Ivan Dewerpe, Hongsen Qin, Roman Pflugfelder, James Bailey, Johnathan Morris, Ville Heilala, Sybille Rosset, Zishun Yu, Peter E. Chen, Woongyeong Yeo, Eeshaan Jain, Ryan Yang, Sreekar Chigurupati, Julia Chernyavsky, Sai Prajwal Reddy, Subhashini Venugopalan, Hunar Batra, Core Francisco Park, Hieu Tran, Guilherme Maximiano, Genghan Zhang, Yizhuo Liang, Hu Shiyu, Rongwu Xu, Rui Pan, Siddharth Suresh, Ziqi Liu, Samaksh Gulati, Songyang Zhang, Peter Turchin, Christopher W. Bartlett, Christopher R. Scotese, Phuong M. Cao, Ben Wu, Jacek Karwowski, Davide Scaramuzza, Aakaash Nattanmai, Gordon McKellips, Anish Cheraku, Asim Suhail, Ethan Luo, Marvin Deng, Jason Luo, Ashley Zhang, Kavin Jindel, Jay Paek, Kasper Halevy, Allen Baranov, Michael Liu, Advaith Avadhanam, David Zhang, Vincent Cheng, Brad Ma, Evan Fu, Liam Do, Joshua Lass, Hubert Yang, Surya Sunkari, Vishruth Bharath, Violet Ai, James Leung, Rishit Agrawal, Alan Zhou, Kevin Chen, Tejas Kalpathi, Ziqi Xu, Gavin Wang, Tyler Xiao, Erik Maung, Sam Lee, Ryan Yang, Roy Yue, Ben Zhao, Julia Yoon, Sunny Sun, Aryan Singh, Ethan Luo, Clark Peng, Tyler Osbey, Taozhi Wang, Daryl Echeazu, Hubert Yang, Timothy Wu, Spandan Patel, Vidhi Kulkarni, Vijaykaarti Sundarapandiyan, Ashley Zhang, Andrew Le, Zafir Nasim, Srikar Yalam, Ritesh Kasamsetty, Soham Samal, Hubert Yang, David Sun, Nihar Shah, Abhijeet Saha, Alex Zhang, Leon Nguyen, Laasya Nagumalli, Kaixin Wang, Alan Zhou, Aidan Wu, Jason Luo, Anwith Telluri, Summer Yue, Alexandr Wang, Dan Hendrycks
Title: Humanity's Last Exam
Abstract:
Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. HLE consists of 2,500 questions across dozens of subjects, including mathematics, humanities, and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading. Each question has a known solution that is unambiguous and easily verifiable, but cannot be quickly answered via internet retrieval. State-of-the-art LLMs demonstrate low accuracy and calibration on HLE, highlighting a significant gap between current LLM capabilities and the expert human frontier on closed-ended academic questions. To inform research and policymaking upon a clear understanding of model capabilities, we publicly release HLE at https://lastexam.ai.
Chinese: 针对现有基准无法有效衡量先进大语言模型能力的问题,我们推出了"人类终极考试"(HLE)这一处于人类知识前沿的多模态基准,该基准揭示了顶尖模型与人类专家水平之间存在的显著能力差距。
English: To address the limitations of existing benchmarks in measuring advanced LLM capabilities, we introduce Humanity's Last Exam (HLE), a multimodal benchmark at the frontier of human knowledge that reveals significant performance gaps between state-of-the-art models and human expertise.

Authors:Jialun Cao, Yuk-Kit Chan, Zixuan Ling, Wenxuan Wang, Shuqing Li, Mingwei Liu, Ruixi Qiao, Yuting Han, Chaozheng Wang, Boxi Yu, Pinjia He, Shuai Wang, Zibin Zheng, Michael R. Lyu, Shing-Chi Cheung
Title: How Should We Build A Benchmark? Revisiting 274 Code-Related Benchmarks For LLMs
Abstract:
Various benchmarks have been proposed to assess the performance of large language models (LLMs) in different coding scenarios. We refer to them as code-related benchmarks. However, there are no systematic guidelines by which such a benchmark should be developed to ensure its quality, reliability, and reproducibility. We propose How2Bench, which is comprised of a 55-criteria checklist as a set of guidelines to govern the development of code-related benchmarks comprehensively. Using HOW2BENCH, we profiled 274 benchmarks released within the past decade and found concerning issues. Nearly 70% of the benchmarks did not take measures for data quality assurance; over 10% did not even open source or only partially open source. Many highly cited benchmarks have loopholes, including duplicated samples, incorrect reference codes/tests/prompts, and unremoved sensitive/confidential information. Finally, we conducted a human study involving 49 participants, which revealed significant gaps in awareness of the importance of data quality, reproducibility, and transparency.
中文摘要:本研究提出了How2Bench这一包含55项标准的检查清单,旨在系统指导代码相关基准的开发,通过对274个基准的分析和人类研究发现现有基准普遍存在数据质量缺陷,并揭示了开发者对数据质量重要性的认知不足。
English Summary: The study introduces How2Bench, a 55-criteria checklist to guide the development of reliable code-related benchmarks, revealing widespread issues in existing benchmarks through an analysis of 274 cases and a human study highlighting gaps in data quality awareness.

Authors:Xintai Chen, Biqian Feng, Yongpeng Wu, Wenjun Zhang
Title: Energy Efficiency Maximization for Movable Antenna-Enhanced System Based on Statistical CSI
Abstract:
This paper investigates an innovative movable antenna (MA)-enhanced multiple-input multiple-output (MIMO) system designed to enhance communication performance. We aim to maximize the energy efficiency (EE) under statistical channel state information (S-CSI) through a joint optimization of the transmit covariance matrix and the antenna position vectors (APVs). To solve the stochastic problem, we consider the large number of antennas scenario and resort to deterministic equivalent (DE) technology to reformulate the system EE w.r.t. the transmit variables, i.e., the transmit covariance matrix and APV, and the receive variables, i.e., the receive APV, respectively. Then, we propose an alternative optimization (AO) algorithm to update the transmit variables and the receive variables to maximize the system EE, respectively. Our numerical results reveal that, the proposed MA-enhanced system can significantly improve EE compared to several benchmark schemes and the optimal performance can be achieved with a finite size of movement regions for MAs.
中文: 本文提出一种可移动天线增强的MIMO系统,通过联合优化发射协方差矩阵与天线位置向量,在有限移动区域内实现能效的最大化。
English: This paper introduces a movable antenna-enhanced MIMO system that optimizes energy efficiency through joint transmit covariance and antenna positioning, achieving superior performance with finite movement regions.

Authors:Thomas Benz, Alessandro Ottaviano, Chaoqun Liang, Robert Balas, Angelo Garofalo, Francesco Restuccia, Alessandro Biondi, Davide Rossi, Luca Benini
Title: AXI-REALM: Safe, Modular and Lightweight Traffic Monitoring and Regulation for Heterogeneous Mixed-Criticality Systems
Abstract:
The automotive industry is transitioning from federated, homogeneous, interconnected devices to integrated, heterogeneous, mixed-criticality systems (MCS). This leads to challenges in achieving timing predictability techniques due to access contention on shared resources, which can be mitigated using hardware-based spatial and temporal isolation techniques. Focusing on the interconnect as the point of access for shared resources, we propose AXI-REALM, a lightweight, modular, technology-independent, and open-source real-time extension to AXI4 interconnects. AXI-REALM uses a budget-based mechanism enforced on periodic time windows and transfer fragmentation to provide fair arbitration, coupled with execution predictability on real-time workloads. AXI-REALM features a comprehensive bandwidth and latency monitor at both the ingress and egress of the interconnect system. Latency information is also used to detect and reset malfunctioning subordinates, preventing missed deadlines. We provide a detailed cost assessment in a 12 nm node and an end-to-end case study implementing AXI-REALM into an open-source MCS, incurring an area overhead of less than 2%. When running a mixed-criticality workload, with a time-critical application sharing the interconnect with non-critical applications, we demonstrate that the critical application can achieve up to 68.2% of the isolated performance by enforcing fairness on the interconnect traffic through burst fragmentation, thus reducing the subordinate access latency by up to 24 times. Near-ideal performance, (above 95% of the isolated performance) can be achieved by distributing the available bandwidth in favor of the critical application.
Chinese: 汽车行业正转向混合关键性系统,面临共享资源竞争带来的时序挑战,AXI-REALM作为AXI4互连的轻量级开源扩展,通过公平仲裁和执行可预测性解决了这些问题,且面积开销极小。
English: The automotive industry is shifting to mixed-criticality systems, facing timing challenges from shared resource contention, which AXI-REALM addresses as a lightweight, open-source extension to AXI4 interconnects, providing fair arbitration and execution predictability with minimal area overhead.

Authors:Deepti Hegde, Rajeev Yasarla, Hong Cai, Shizhong Han, Apratim Bhattacharyya, Shweta Mahajan, Litian Liu, Risheek Garrepalli, Vishal M. Patel, Fatih Porikli
Title: Distilling Multi-modal Large Language Models for Autonomous Driving
Abstract:
Autonomous driving demands safe motion planning, especially in critical "long-tail" scenarios. Recent end-to-end autonomous driving systems leverage large language models (LLMs) as planners to improve generalizability to rare events. However, using LLMs at test time introduces high computational costs. To address this, we propose DiMA, an end-to-end autonomous driving system that maintains the efficiency of an LLM-free (or vision-based) planner while leveraging the world knowledge of an LLM. DiMA distills the information from a multi-modal LLM to a vision-based end-to-end planner through a set of specially designed surrogate tasks. Under a joint training strategy, a scene encoder common to both networks produces structured representations that are semantically grounded as well as aligned to the final planning objective. Notably, the LLM is optional at inference, enabling robust planning without compromising on efficiency. Training with DiMA results in a 37% reduction in the L2 trajectory error and an 80% reduction in the collision rate of the vision-based planner, as well as a 44% trajectory error reduction in longtail scenarios. DiMA also achieves state-of-the-art performance on the nuScenes planning benchmark.
中文: DiMA是一种高效的自动驾驶系统,通过专门设计的代理任务将多模态大语言模型的知识提炼到基于视觉的规划器中,在无需推理时使用大语言模型的情况下,显著提升了轨迹精度并降低了碰撞率。
English: DiMA is an efficient autonomous driving system that distills knowledge from a multi-modal LLM into a vision-based planner through specialized surrogate tasks, eliminating the need for LLMs during inference while significantly improving trajectory accuracy and reducing collisions.

Authors:Yuxuan Shi, Shuo Shao, Yongpeng Wu, Wenjun Zhang, Merouane Debbah
Title: RWZC: A Model-Driven Approach for Learning-based Robust Wyner-Ziv Coding
Abstract:
In this paper, a novel learning-based Wyner-Ziv coding framework is considered under a distributed image transmission scenario, where the correlated source is only available at the receiver. Unlike other learnable frameworks, our approach demonstrates robustness to non-stationary source correlation, where the overlapping information between image pairs varies. Specifically, we first model the affine relationship between correlated images and leverage this model for learnable mask generation and rate-adaptive joint source-channel coding. Moreover, we also provide a warping-prediction network to remove the distortion from channel interference and affine transform. Intuitively, the observed performance improvement is largely due to focusing on the simple geometric relationship, rather than the complex joint distribution between the sources. Numerical results show that our framework achieves a 1.5 dB gain in PSNR and a 0.2 improvement in MS-SSIM, along with a significant superiority in perceptual metrics, compared to state-of-the-art methods when applied to real-world samples with non-stationary correlations.
Chinese: 本文提出了一种基于学习的Wyner-Ziv编码框架,通过建立图像间的仿射关系模型并采用自适应编码技术,对非平稳源相关性表现出强鲁棒性,相比现有方法实现了显著的性能提升。
English: This paper introduces a learning-based Wyner-Ziv coding framework that demonstrates robustness to non-stationary source correlations by modeling affine relationships between images and employing adaptive coding techniques, achieving significant performance gains over existing methods.

Authors:Zeineb Haouari, Jonas Weidner, Yeray Martin-Ruisanchez, Ivan Ezhov, Aswathi Varma, Daniel Rueckert, Bjoern Menze, Benedikt Wiestler
Title: Efficient Deep Learning-based Forward Solvers for Brain Tumor Growth Models
Abstract:
Glioblastoma, a highly aggressive brain tumor, poses major challenges due to its poor prognosis and high morbidity rates. Partial differential equation-based models offer promising potential to enhance therapeutic outcomes by simulating patient-specific tumor behavior for improved radiotherapy planning. However, model calibration remains a bottleneck due to the high computational demands of optimization methods like Monte Carlo sampling and evolutionary algorithms. To address this, we recently introduced an approach leveraging a neural forward solver with gradient-based optimization to significantly reduce calibration time. This approach requires a highly accurate and fully differentiable forward model. We investigate multiple architectures, including (i) an enhanced TumorSurrogate, (ii) a modified nnU-Net, and (iii) a 3D Vision Transformer (ViT). The nnU-Net achieved the best overall results, excelling in both tumor outline matching and voxel-level prediction of tumor cell concentration. It yielded the lowest MSE in tumor cell concentration compared to ground truth numerical simulation and the highest Dice score across all tumor cell concentration thresholds. Our study demonstrates significant enhancement in forward solver performance and outlines important future research directions.
中文摘要:本研究通过评估多种神经网络架构在胶质母细胞瘤模拟中的表现,解决了模型校准的计算瓶颈问题,其中改进版nnU-Net在肿瘤细胞浓度预测和轮廓匹配方面取得了最佳性能。
English Summary: This study addresses the computational bottleneck in calibrating glioblastoma models by evaluating neural network architectures for tumor simulation, with nnU-Net achieving superior accuracy in predicting tumor cell concentration and outline matching.

Authors:Paul Scheffler, Thomas Benz, Viviane Potocnik, Tim Fischer, Luca Colagrande, Nils Wistoff, Yichao Zhang, Luca Bertaccini, Gianmarco Ottavi, Manuel Eggimann, Matheus Cavalcante, Gianna Paulin, Frank K. Gürkaynak, Davide Rossi, Luca Benini
Title: Occamy: A 432-Core Dual-Chiplet Dual-HBM2E 768-DP-GFLOP/s RISC-V System for 8-to-64-bit Dense and Sparse Computing in 12nm FinFET
Abstract:
ML and HPC applications increasingly combine dense and sparse memory access computations to maximize storage efficiency. However, existing CPUs and GPUs struggle to flexibly handle these heterogeneous workloads with consistently high compute efficiency. We present Occamy, a 432-Core, 768-DP-GFLOP/s, dual-HBM2E, dual-chiplet RISC-V system with a latency-tolerant hierarchical interconnect and in-core streaming units (SUs) designed to accelerate dense and sparse FP8-to-FP64 ML and HPC workloads. We implement Occamy's compute chiplets in 12 nm FinFET, and its passive interposer, Hedwig, in a 65 nm node. On dense linear algebra (LA), Occamy achieves a competitive FPU utilization of 89%. On stencil codes, Occamy reaches an FPU utilization of 83% and a technology-node-normalized compute density of 11.1 DP-GFLOP/s/mm2,leading state-of-the-art (SoA) processors by 1.7x and 1.2x, respectively. On sparse-dense linear algebra (LA), it achieves 42% FPU utilization and a normalized compute density of 5.95 DP-GFLOP/s/mm2, surpassing the SoA by 5.2x and 11x, respectively. On, sparse-sparse LA, Occamy reaches a throughput of up to 187 GCOMP/s at 17.4 GCOMP/s/W and a compute density of 3.63 GCOMP/s/mm2. Finally, we reach up to 75% and 54% FPU utilization on and dense (LLM) and graph-sparse (GCN) ML inference workloads. Occamy's RTL is freely available under a permissive open-source license.
中文: Occamy是一款双芯粒RISC-V系统,采用分层互连和核心内流处理单元设计,能高效加速密集与稀疏的机器学习和高性能计算任务,在多种应用中实现高浮点单元利用率,并在计算密度上显著超越现有先进处理器。
English: Occamy is a dual-chiplet RISC-V system featuring a hierarchical interconnect and in-core streaming units, designed to efficiently accelerate both dense and sparse ML and HPC workloads, achieving high FPU utilization and surpassing state-of-the-art processors in compute density across various applications.

Authors:Zhengyang Tang, Ziniu Li, Zhenyang Xiao, Tian Ding, Ruoyu Sun, Benyou Wang, Dayiheng Liu, Fei Huang, Tianyu Liu, Bowen Yu, Junyang Lin
Title: Self-Evolving Critique Abilities in Large Language Models
Abstract:
Despite their remarkable performance, Large Language Models (LLMs) face a critical challenge: providing feedback for tasks where human evaluation is difficult or where LLMs potentially outperform humans. In such scenarios, leveraging the critique ability of LLMs themselves - identifying and correcting flaws - shows considerable promise. This paper explores enhancing critique abilities of LLMs, noting that current approaches rely on human annotations or more powerful models, leaving the challenge of improving critique abilities without external supervision unresolved. We introduce SCRIT (Self-evolving CRITic), a framework that trains LLMs with self-generated data to evolve their critique abilities. To address the low quality of naively generated data, we propose a contrastive-critic approach that uses reference solutions during data synthesis to enhance the model's understanding of key concepts, and incorporates a self-validation scheme to ensure data quality. The final trained model operates without any reference solutions at inference time. Implemented with Qwen2.5-72B-Instruct, a leading LLM, SCRIT demonstrates consistent improvements across a wide range of benchmarks spanning both mathematical and scientific reasoning: achieving a 10.0\% relative gain in critique-correction accuracy and a 19.0\% relative improvement in error identification F1-score. Our analysis reveals that SCRIT's performance scales positively with data and model size and enables continuous improvement through multi-round iterations.
中文: 本文提出SCRIT自演进框架,通过采用对比式评判方法和自验证机制训练大语言模型使用自生成数据,在无需外部监督的情况下显著提升了模型在数学与科学推理任务中的评判能力。
English: This paper introduces SCRIT, a self-evolving framework that enhances Large Language Models' critique abilities by training them with self-generated data using contrastive-critic methods and self-validation, achieving significant improvements in mathematical and scientific reasoning benchmarks without external supervision.

Authors:Tianyu Cui, Jinbin Bai, Guo-Hua Wang, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, Ye Shi
Title: Evaluating Image Caption via Cycle-consistent Text-to-Image Generation
Abstract:
Evaluating image captions typically relies on reference captions, which are costly to obtain and exhibit significant diversity and subjectivity. While reference-free evaluation metrics have been proposed, most focus on cross-modal evaluation between captions and images. Recent research has revealed that the modality gap generally exists in the representation of contrastive learning-based multi-modal systems, undermining the reliability of cross-modality metrics like CLIPScore. In this paper, we propose CAMScore, a cyclic reference-free automatic evaluation metric for image captioning models. To circumvent the aforementioned modality gap, CAMScore utilizes a text-to-image model to generate images from captions and subsequently evaluates these generated images against the original images. Furthermore, to provide fine-grained information for a more comprehensive evaluation, we design a three-level evaluation framework for CAMScore that encompasses pixel-level, semantic-level, and objective-level perspectives. Extensive experiment results across multiple benchmark datasets show that CAMScore achieves a superior correlation with human judgments compared to existing reference-based and reference-free metrics, demonstrating the effectiveness of the framework.
中文: CAMScore提出了一种无需参考图像描述的自动评估方法,通过文本生成图像模型将描述转化为图像并与原图比较,采用像素、语义和客观三个层面的评估框架,在多个基准测试中显示出比现有方法更高的人类判断相关性。
English: CAMScore introduces a novel reference-free evaluation metric for image captions by using a text-to-image model to generate images from captions and comparing them with the original images, achieving higher correlation with human judgments than existing metrics through a three-level evaluation framework.

Authors:Hui Sun, Shiyin Lu, Huanyu Wang, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, Ming Li
Title: MDP3: A Training-free Approach for List-wise Frame Selection in Video-LLMs
Abstract:
Video large language models (Video-LLMs) have made significant progress in understanding videos. However, processing multiple frames leads to lengthy visual token sequences, presenting challenges such as the limited context length cannot accommodate the entire video, and the inclusion of irrelevant frames hinders visual perception. Hence, effective frame selection is crucial. This paper emphasizes that frame selection should follow three key principles: query relevance, list-wise diversity, and sequentiality. Existing methods, such as uniform frame sampling and query-frame matching, do not capture all of these principles. Thus, we propose Markov decision determinantal point process with dynamic programming (MDP3) for frame selection, a training-free and model-agnostic method that can be seamlessly integrated into existing Video-LLMs. Our method first estimates frame similarities conditioned on the query using a conditional Gaussian kernel within the reproducing kernel Hilbert space~(RKHS). We then apply the determinantal point process~(DPP) to the similarity matrix to capture both query relevance and list-wise diversity. To incorporate sequentiality, we segment the video and apply DPP within each segment, conditioned on the preceding segment selection, modeled as a Markov decision process~(MDP) for allocating selection sizes across segments. Theoretically, MDP3 provides a \((1 - 1/e)\)-approximate solution to the NP-hard list-wise frame selection problem with pseudo-polynomial time complexity, demonstrating its efficiency. Empirically, MDP3 significantly outperforms existing methods, verifying its effectiveness and robustness.
中文摘要:本文提出MDP3方法,通过结合行列式点过程和马尔可夫决策过程,实现了视频大语言模型中兼顾查询相关性、列表多样性和时序性的无训练框架选择方案。
English Summary: This paper introduces MDP3, a training-free and model-agnostic method for frame selection in Video-LLMs that ensures query relevance, list-wise diversity, and sequentiality by combining determinantal point processes with Markov decision processes.

Authors:Xue Han, Yongpeng Wu, Zhen Gao, Biqian Feng, Yuxuan Shi, Deniz Gündüz, Wenjun Zhang
Title: SCSC: A Novel Standards-Compatible Semantic Communication Framework for Image Transmission
Abstract:
Joint source-channel coding (JSCC) is a promising paradigm for next-generation communication systems, particularly in challenging transmission environments. In this paper, we propose a novel standard-compatible JSCC framework for the transmission of images over multiple-input multiple-output (MIMO) channels. Different from the existing end-to-end AI-based DeepJSCC schemes, our framework consists of learnable modules that enable communication using conventional separate source and channel codes (SSCC), which makes it amenable for easy deployment on legacy systems. Specifically, the learnable modules involve a preprocessing-empowered network (PPEN) for preserving essential semantic information, and a precoder \& combiner-enhanced network (PCEN) for efficient transmission over a resource-constrained MIMO channel. We treat existing compression and channel coding modules as non-trainable blocks. Since the parameters of these modules are non-differentiable, we employ a proxy network that mimics their operations when training the learnable modules. Numerical results demonstrate that our scheme can save more than 29\% of the channel bandwidth, and requires lower complexity compared to the constrained baselines. We also show its generalization capability to unseen datasets and tasks through extensive experiments.
中文: 本文提出了一种新颖的标准兼容联合信源信道编码框架,用于多输入多输出信道上的图像传输,该框架通过可学习模块在保持与传统系统兼容的同时,有效保留语义信息并提升传输效率。
English: This paper introduces a novel standard-compatible joint source-channel coding framework for image transmission over MIMO channels, which employs learnable modules to preserve semantic information and enhance transmission efficiency while maintaining compatibility with legacy systems.

Authors:Ziyang Song, Zerong Wang, Bo Li, Hao Zhang, Ruijie Zhu, Li Liu, Peng-Tao Jiang, Tianzhu Zhang
Title: DepthMaster: Taming Diffusion Models for Monocular Depth Estimation
Abstract:
Monocular depth estimation within the diffusion-denoising paradigm demonstrates impressive generalization ability but suffers from low inference speed. Recent methods adopt a single-step deterministic paradigm to improve inference efficiency while maintaining comparable performance. However, they overlook the gap between generative and discriminative features, leading to suboptimal results. In this work, we propose DepthMaster, a single-step diffusion model designed to adapt generative features for the discriminative depth estimation task. First, to mitigate overfitting to texture details introduced by generative features, we propose a Feature Alignment module, which incorporates high-quality semantic features to enhance the denoising network's representation capability. Second, to address the lack of fine-grained details in the single-step deterministic framework, we propose a Fourier Enhancement module to adaptively balance low-frequency structure and high-frequency details. We adopt a two-stage training strategy to fully leverage the potential of the two modules. In the first stage, we focus on learning the global scene structure with the Feature Alignment module, while in the second stage, we exploit the Fourier Enhancement module to improve the visual quality. Through these efforts, our model achieves state-of-the-art performance in terms of generalization and detail preservation, outperforming other diffusion-based methods across various datasets. Our project page can be found at https://indu1ge.github.io/DepthMaster_page.
Chinese: 提出的DepthMaster模型通过特征对齐模块增强语义表示和傅里叶增强模块保留细节,在单步扩散框架中实现了卓越的泛化能力和推理效率,显著提升了单目深度估计性能。
English: The proposed DepthMaster model enhances monocular depth estimation by integrating a Feature Alignment module for semantic enrichment and a Fourier Enhancement module for detail preservation, achieving superior generalization and inference efficiency in a single-step diffusion framework.

Authors:Wenyan Cong, Hanqing Zhu, Kevin Wang, Jiahui Lei, Colton Stearns, Yuanhao Cai, Leonidas Guibas, Zhangyang Wang, Zhiwen Fan
Title: VideoLifter: Lifting Videos to 3D with Fast Hierarchical Stereo Alignment
Abstract:
Efficiently reconstructing 3D scenes from monocular video remains a core challenge in computer vision, vital for applications in virtual reality, robotics, and scene understanding. Recently, frame-by-frame progressive reconstruction without camera poses is commonly adopted, incurring high computational overhead and compounding errors when scaling to longer videos. To overcome these issues, we introduce VideoLifter, a novel video-to-3D pipeline that leverages a local-to-global strategy on a fragment basis, achieving both extreme efficiency and SOTA quality. Locally, VideoLifter leverages learnable 3D priors to register fragments, extracting essential information for subsequent 3D Gaussian initialization with enforced inter-fragment consistency and optimized efficiency. Globally, it employs a tree-based hierarchical merging method with key frame guidance for inter-fragment alignment, pairwise merging with Gaussian point pruning, and subsequent joint optimization to ensure global consistency while efficiently mitigating cumulative errors. This approach significantly accelerates the reconstruction process, reducing training time by over 82% while holding better visual quality than current SOTA methods.
Chinese: VideoLifter提出了一种新颖的局部到全局视频转3D流程,通过基于片段的配准和分层融合策略,在实现82%以上训练加速的同时,获得了超越现有最优方法的视觉质量。
English: VideoLifter introduces a novel local-to-global pipeline for efficient 3D scene reconstruction from monocular video, achieving over 82% faster training with superior visual quality by leveraging fragment-based registration and hierarchical merging.

Authors:Shanghaoran Quan, Jiaxi Yang, Bowen Yu, Bo Zheng, Dayiheng Liu, An Yang, Xuancheng Ren, Bofei Gao, Yibo Miao, Yunlong Feng, Zekun Wang, Jian Yang, Zeyu Cui, Yang Fan, Yichang Zhang, Binyuan Hui, Junyang Lin
Title: CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings
Abstract:
With the increasing code reasoning capabilities of existing large language models (LLMs) and breakthroughs in reasoning models like OpenAI o1 and o3, there is a growing need to develop more challenging and comprehensive benchmarks that effectively test their sophisticated competition-level coding abilities. Existing benchmarks, like LiveCodeBench and USACO, fall short due to the unavailability of private test cases, lack of support for special judges, and misaligned execution environments. To bridge this gap, we introduce CodeElo, a standardized competition-level code generation benchmark that effectively addresses all these challenges for the first time. CodeElo benchmark is mainly based on the official CodeForces platform and tries to align with the platform as much as possible. We compile the recent six months of contest problems on CodeForces with detailed information such as contest divisions, problem difficulty ratings, and problem algorithm tags. We introduce a unique judging method in which problems are submitted directly to the platform and develop a reliable Elo rating calculation system that aligns with the platform and is comparable with human participants but has lower variance. By testing on our CodeElo, we provide the Elo ratings of 30 existing popular open-source and 3 proprietary LLMs for the first time. The results show that o1-mini and QwQ-32B-Preview stand out significantly, achieving Elo ratings of 1578 and 1261, respectively, while other models struggle even with the easiest problems, placing in the lowest 25 percent among all human participants. Detailed analysis experiments are also conducted to provide insights into performance across algorithms and comparisons between using C++ and Python, which can suggest directions for future studies.
Chinese: 为解决现有基准在评估高级编程能力上的不足,CodeElo作为一个基于CodeForces的标准化竞赛级代码生成基准被提出,其独特的评判方法和Elo评分系统显示o1-mini和QwQ-32B-Preview表现突出,而其他模型则面临挑战。
English: To address the limitations of existing benchmarks in evaluating advanced coding abilities, CodeElo is introduced as a standardized competition-level code generation benchmark based on CodeForces, featuring a unique judging method and Elo rating system that reveals top-performing models like o1-mini and QwQ-32B-Preview while highlighting the struggles of others.

Authors:Hao Wang, Xiwen Chen, Ashish Bastola, Jiayou Qin, Abolfazl Razi
Title: Diffusion Prism: Enhancing Diversity and Morphology Consistency in Mask-to-Image Diffusion
Abstract:
The emergence of generative AI and controllable diffusion has made image-to-image synthesis increasingly practical and efficient. However, when input images exhibit low entropy and sparse, the inherent characteristics of diffusion models often result in limited diversity. This constraint significantly interferes with data augmentation. To address this, we propose Diffusion Prism, a training-free framework that efficiently transforms binary masks into realistic and diverse samples while preserving morphological features. We explored that a small amount of artificial noise will significantly assist the image-denoising process. To prove this novel mask-to-image concept, we use nano-dendritic patterns as an example to demonstrate the merit of our method compared to existing controllable diffusion models. Furthermore, we extend the proposed framework to other biological patterns, highlighting its potential applications across various fields.
中文摘要:本研究提出Diffusion Prism框架,通过引入微量人工噪声有效提升低熵输入图像的多样性,在保持形态特征的同时将二元掩码转化为逼真样本,为数据增强等应用提供了新方案。
English Summary: The study introduces Diffusion Prism, a training-free framework that enhances image diversity in low-entropy inputs by incorporating minimal artificial noise, effectively transforming binary masks into realistic samples while preserving morphological features for applications like data augmentation.

Authors:Zhiyuan Wu, Sheng Sun, Yuwei Wang, Min Liu, Ke Xu, Quyang Pan, Bo Gao, Tian Wen
Title: Beyond Model Scale Limits: End-Edge-Cloud Federated Learning with Self-Rectified Knowledge Agglomeration
Abstract:
The rise of End-Edge-Cloud Collaboration (EECC) offers a promising paradigm for Artificial Intelligence (AI) model training across end devices, edge servers, and cloud data centers, providing enhanced reliability and reduced latency. Hierarchical Federated Learning (HFL) can benefit from this paradigm by enabling multi-tier model aggregation across distributed computing nodes. However, the potential of HFL is significantly constrained by the inherent heterogeneity and dynamic characteristics of EECC environments. Specifically, the uniform model structure bounded by the least powerful end device across all computing nodes imposes a performance bottleneck. Meanwhile, coupled heterogeneity in data distributions and resource capabilities across tiers disrupts hierarchical knowledge transfer, leading to biased updates and degraded performance. Furthermore, the mobility and fluctuating connectivity of computing nodes in EECC environments introduce complexities in dynamic node migration, further compromising the robustness of the training process. To address multiple challenges within a unified framework, we propose End-Edge-Cloud Federated Learning with Self-Rectified Knowledge Agglomeration (FedEEC), which is a novel EECC-empowered FL framework that allows the trained models from end, edge, to cloud to grow larger in size and stronger in generalization ability. FedEEC introduces two key innovations: (1) Bridge Sample Based Online Distillation Protocol (BSBODP), which enables knowledge transfer between neighboring nodes through generated bridge samples, and (2) Self-Knowledge Rectification (SKR), which refines the transferred knowledge to prevent suboptimal cloud model optimization. The proposed framework effectively handles both cross-tier resource heterogeneity and effective knowledge transfer between neighboring nodes, while satisfying the migration-resilient requirements of EECC.
中文摘要:端边云协同范式虽能提升分层联邦学习的性能,却受限于系统异构性和动态环境;FedEEC框架通过桥接样本蒸馏和自知识校正技术,有效解决了跨层资源差异和知识迁移问题,实现了可扩展且鲁棒的AI训练。
English Summary: The End-Edge-Cloud Collaboration paradigm enhances Hierarchical Federated Learning but faces challenges from system heterogeneity and dynamic environments, which the proposed FedEEC framework addresses through bridge sample distillation and self-knowledge rectification to enable scalable and robust AI training.

Authors:Kunpeng Zhang, Zongjie Li, Daoyuan Wu, Shuai Wang, Xin Xia
Title: Low-Cost and Comprehensive Non-textual Input Fuzzing with LLM-Synthesized Input Generators
Abstract:
Modern software often accepts inputs with highly complex grammars. Recent advances in large language models (LLMs) have shown that they can be used to synthesize high-quality natural language text and code that conforms to the grammar of a given input format. Nevertheless, LLMs are often incapable or too costly to generate non-textual outputs, such as images, videos, and PDF files. This limitation hinders the application of LLMs in grammar-aware fuzzing. We present a novel approach to enabling grammar-aware fuzzing over non-textual inputs. We employ LLMs to synthesize and also mutate input generators, in the form of Python scripts, that generate data conforming to the grammar of a given input format. Then, non-textual data yielded by the input generators are further mutated by traditional fuzzers (AFL++) to explore the software input space effectively. Our approach, namely G2FUZZ, features a hybrid strategy that combines a holistic search driven by LLMs and a local search driven by industrial quality fuzzers. Two key advantages are: (1) LLMs are good at synthesizing and mutating input generators and enabling jumping out of local optima, thus achieving a synergistic effect when combined with mutation-based fuzzers; (2) LLMs are less frequently invoked unless really needed, thus significantly reducing the cost of LLM usage. We have evaluated G2FUZZ on a variety of input formats, including TIFF images, MP4 audios, and PDF files. The results show that G2FUZZ outperforms SOTA tools such as AFL++, Fuzztruction, and FormatFuzzer in terms of code coverage and bug finding across most programs tested on three platforms: UNIFUZZ, FuzzBench, and MAGMA.
Chinese: G2FUZZ提出了一种混合模糊测试方法,利用大语言模型生成和变异非文本格式的输入生成器,并结合传统模糊测试器,在降低大语言模型使用成本的同时,显著提高了代码覆盖率和漏洞发现能力。
English: G2FUZZ introduces a hybrid fuzzing approach that uses LLMs to create and mutate input generators for non-textual formats, combining them with traditional fuzzers to enhance code coverage and bug detection while reducing LLM usage costs.

Authors:Minh Le, Anh Nguyen, Huy Nguyen, Chau Nguyen, Anh Tran, Nhat Ho
Title: On the Expressiveness of Visual Prompt Experts
Abstract:
Visual Prompt Tuning (VPT) has proven effective for parameter-efficient adaptation of pre-trained vision models to downstream tasks by inserting task-specific learnable prompt tokens. Despite its empirical success, a comprehensive theoretical understanding of VPT remains an active area of research. Building on the recently established connection between Mixture of Experts (MoE) and prompt-based methods, wherein each attention head can be conceptualized as a composition of multiple MoE models, we reinterpret VPT as the introduction of new prompt experts into these MoE structures. We identify a key limitation in existing VPT frameworks: the restricted functional expressiveness of prompt experts, which remain static and thus limited in their adaptability. To address this, we propose Visual Adaptive Prompt Tuning (VAPT), a novel method that endows prompt experts with enhanced expressiveness while preserving parameter efficiency. Empirical evaluations on VTAB-1K and FGVC demonstrate that VAPT achieves substantial performance improvements, surpassing fully fine-tuned baselines by 7.34% and 1.04%, respectively. Moreover, VAPT consistently outperforms VPT while requiring fewer additional parameters. Furthermore, our theoretical analysis indicates that VAPT achieves optimal sample efficiency. Collectively, these results underscore the theoretical grounding and empirical advantages of our approach.
Chinese: 视觉自适应提示调优(VAPT)通过增强预训练视觉模型中提示专家的表达能力,以更少的参数实现了优于现有方法的性能和样本效率。
English: Visual Adaptive Prompt Tuning (VAPT) enhances the expressiveness of prompt experts in pre-trained vision models, achieving superior performance and sample efficiency with fewer parameters than existing methods.

Authors:Wensheng Gan, Zhenyao Ning, Zhenlian Qi, Philip S. Yu
Title: Mixture of Experts (MoE): A Big Data Perspective
Abstract:
As the era of big data arrives, traditional artificial intelligence algorithms have difficulty processing the demands of massive and diverse data. Mixture of experts (MoE) has shown excellent performance and broad application prospects. This paper provides an in-depth review and analysis of the latest progress in this field from multiple perspectives, including the basic principles, algorithmic models, key technical challenges, and application practices of MoE. First, we introduce the basic concept of MoE and its core idea and elaborate on its advantages over traditional single models. Then, we discuss the basic architecture of MoE and its main components, including the gating network, expert networks, and learning algorithms. Next, we review the applications of MoE in addressing key technical issues in big data. For each challenge, we provide specific MoE solutions and their innovations. Furthermore, we summarize the typical use cases of MoE in various application domains. This fully demonstrates the powerful capability of MoE in big data processing. We also analyze the advantages of MoE in big data environments. Finally, we explore the future development trends of MoE. We believe that MoE will become an important paradigm of artificial intelligence in the era of big data. In summary, this paper systematically elaborates on the principles, techniques, and applications of MoE in big data processing, providing theoretical and practical references to further promote the application of MoE in real scenarios.
中文: 本文全面综述了专家混合模型(MoE),通过其架构、技术方案和多样化应用展示了其在大数据处理中的卓越性能,并预测其将成为未来人工智能的重要范式。
English: This paper comprehensively reviews the mixture of experts (MoE) model, highlighting its superior performance in handling big data challenges through its architecture, technical solutions, and diverse applications, and predicts its role as a key AI paradigm in the future.

Authors:Sami Baral, Li Lucy, Ryan Knight, Alice Ng, Luca Soldaini, Neil T. Heffernan, Kyle Lo
Title: DrawEduMath: Evaluating Vision Language Models with Expert-Annotated Students' Hand-Drawn Math Images
Abstract:
In real-world settings, vision language models (VLMs) should robustly handle naturalistic, noisy visual content as well as domain-specific language and concepts. For example, K-12 educators using digital learning platforms may need to examine and provide feedback across many images of students' math work. To assess the potential of VLMs to support educators in settings like this one, we introduce DrawEduMath, an English-language dataset of 2,030 images of students' handwritten responses to K-12 math problems. Teachers provided detailed annotations, including free-form descriptions of each image and 11,661 question-answer (QA) pairs. These annotations capture a wealth of pedagogical insights, ranging from students' problem-solving strategies to the composition of their drawings, diagrams, and writing. We evaluate VLMs on teachers' QA pairs, as well as 44,362 synthetic QA pairs derived from teachers' descriptions using language models (LMs). We show that even state-of-the-art VLMs leave much room for improvement on DrawEduMath questions. We also find that synthetic QAs, though imperfect, can yield similar model rankings as teacher-written QAs. We release DrawEduMath to support the evaluation of VLMs' abilities to reason mathematically over images gathered with educational contexts in mind.
中文摘要:DrawEduMath数据集评估了视觉语言模型分析教育数学图像的能力,揭示了模型性能存在显著不足,并证明合成问答对能像教师标注一样有效评估模型表现。
English Summary: The DrawEduMath dataset evaluates vision language models' ability to analyze educational math images, revealing significant performance gaps and demonstrating that synthetic question-answer pairs can effectively rank models similarly to teacher-annotated ones.

Authors:Zhi Zhou, Hao-Zhe Tan, Peng-Xiao Song, Lan-Zhe Guo
Title: CGI: Identifying Conditional Generative Models with Example Images
Abstract:
Generative models have achieved remarkable performance recently, and thus model hubs have emerged. Existing model hubs typically assume basic text matching is sufficient to search for models. However, in reality, due to different abstractions and the large number of models in model hubs, it is not easy for users to review model descriptions and example images, choosing which model best meets their needs. Therefore, it is necessary to describe model functionality wisely so that future users can efficiently search for the most suitable model for their needs. Efforts to address this issue remain limited. In this paper, we propose Conditional Generative Model Identification (CGI), which aims to provide an effective way to identify the most suitable model using user-provided example images rather than requiring users to manually review a large number of models with example images. To address this problem, we propose the PromptBased Model Identification (PMI) , which can adequately describe model functionality and precisely match requirements with specifications. To evaluate PMI approach and promote related research, we provide a benchmark comprising 65 models and 9100 identification tasks. Extensive experimental and human evaluation results demonstrate that PMI is effective. For instance, 92% of models are correctly identified with significantly better FID scores when four example images are provided.
中文: 生成模型库因模型抽象多样和数量庞大而难以高效筛选,为此提出基于提示的模型识别方法(PMI),通过用户提供的示例图像精准匹配需求与模型功能,实验证明该方法效果显著。
English: Generative model hubs face challenges in efficient model selection due to diverse abstractions and large quantities, prompting the proposal of PromptBased Model Identification (PMI) to accurately match user requirements with model functionality using example images, which proves highly effective in evaluations.

Authors:Zhi Zhou, Hao-Zhe Tan, Peng-Xiao Song, Lan-Zhe Guo
Title: CGI: Identifying Conditional Generative Models with Example Images
Abstract:
Generative models have achieved remarkable performance recently, and thus model hubs have emerged. Existing model hubs typically assume basic text matching is sufficient to search for models. However, in reality, due to different abstractions and the large number of models in model hubs, it is not easy for users to review model descriptions and example images, choosing which model best meets their needs. Therefore, it is necessary to describe model functionality wisely so that future users can efficiently search for the most suitable model for their needs. Efforts to address this issue remain limited. In this paper, we propose Conditional Generative Model Identification (CGI), which aims to provide an effective way to identify the most suitable model using user-provided example images rather than requiring users to manually review a large number of models with example images. To address this problem, we propose the PromptBased Model Identification (PMI) , which can adequately describe model functionality and precisely match requirements with specifications. To evaluate PMI approach and promote related research, we provide a benchmark comprising 65 models and 9100 identification tasks. Extensive experimental and human evaluation results demonstrate that PMI is effective. For instance, 92% of models are correctly identified with significantly better FID scores when four example images are provided.
中文: 生成模型库因模型抽象多样和数量庞大而难以高效筛选,为此提出基于提示的模型识别方法(PMI),通过用户提供的示例图像精准匹配需求与模型功能,实验证明该方法效果显著。
English: Generative model hubs face challenges in efficient model selection due to diverse abstractions and large quantities, prompting the proposal of PromptBased Model Identification (PMI) to accurately match user requirements with model functionality using example images, which proves highly effective in evaluations.

Authors:Lianrui Zuo, Xin Yu, Dingjie Su, Kaiwen Xu, Aravind R. Krishnan, Yihao Liu, Shunxing Bao, Fabien Maldonado, Luigi Ferrucci, Bennett A. Landman
Title: Robust Body Composition Analysis by Generating 3D CT Volumes from Limited 2D Slices
Abstract:
Body composition analysis provides valuable insights into aging, disease progression, and overall health conditions. Due to concerns of radiation exposure, two-dimensional (2D) single-slice computed tomography (CT) imaging has been used repeatedly for body composition analysis. However, this approach introduces significant spatial variability that can impact the accuracy and robustness of the analysis. To mitigate this issue and facilitate body composition analysis, this paper presents a novel method to generate 3D CT volumes from limited number of 2D slices using a latent diffusion model (LDM). Our approach first maps 2D slices into a latent representation space using a variational autoencoder. An LDM is then trained to capture the 3D context of a stack of these latent representations. To accurately interpolate intermediateslices and construct a full 3D volume, we utilize body part regression to determine the spatial location and distance between the acquired slices. Experiments on both in-house and public 3D abdominal CT datasets demonstrate that the proposed method significantly enhances body composition analysis compared to traditional 2D-based analysis, with a reduced error rate from 23.3% to 15.2%.
中文: 本文提出了一种利用潜在扩散模型从有限二维切片生成三维CT图像的新方法,将体成分分析的误差率从23.3%降至15.2%,显著提升了分析准确性。
English: This paper introduces a novel method using a latent diffusion model to generate 3D CT volumes from limited 2D slices, significantly improving body composition analysis accuracy by reducing the error rate from 23.3% to 15.2%.

Authors:Lianrui Zuo, Kaiwen Xu, Dingjie Su, Xin Yu, Aravind R. Krishnan, Yihao Liu, Shunxing Bao, Thomas Li, Kim L. Sandler, Fabien Maldonado, Bennett A. Landman
Title: Beyond the Lungs: Extending the Field of View in Chest CT with Latent Diffusion Models
Abstract:
The interconnection between the human lungs and other organs, such as the liver and kidneys, is crucial for understanding the underlying risks and effects of lung diseases and improving patient care. However, most research chest CT imaging is focused solely on the lungs due to considerations of cost and radiation dose. This restricted field of view (FOV) in the acquired images poses challenges to comprehensive analysis and hinders the ability to gain insights into the impact of lung diseases on other organs. To address this, we propose SCOPE (Spatial Coverage Optimization with Prior Encoding), a novel approach to capture the inter-organ relationships from CT images and extend the FOV of chest CT images. Our approach first trains a variational autoencoder (VAE) to encode 2D axial CT slices individually, then stacks the latent representations of the VAE to form a 3D context for training a latent diffusion model. Once trained, our approach extends the FOV of CT images in the z-direction by generating new axial slices in a zero-shot manner. We evaluated our approach on the National Lung Screening Trial (NLST) dataset, and results suggest that it effectively extends the FOV to include the liver and kidneys, which are not completely covered in the original NLST data acquisition. Quantitative results on a held-out whole-body dataset demonstrate that the generated slices exhibit high fidelity with acquired data, achieving an SSIM of 0.81.
中文: 提出的SCOPE方法通过潜在扩散模型生成额外轴向切片,扩展了胸部CT扫描的视野范围,从而能够纳入肝脏和肾脏等器官,以更好地研究肺部疾病中的器官间关联。
English: The proposed SCOPE method extends the field of view in chest CT scans by generating additional axial slices using a latent diffusion model, enabling the inclusion of organs like the liver and kidneys to better study inter-organ relationships in lung diseases.

Authors:Dunwei Tu, Huiyu Yi, Yuchi Wang, Baile Xu, Jian Zhao, Furao Shen
Title: Multiple Queries with Multiple Keys: A Precise Prompt Matching Paradigm for Prompt-based Continual Learning
Abstract:
Continual learning requires machine learning models to continuously acquire new knowledge in dynamic environments while avoiding the forgetting of previous knowledge. Prompt-based continual learning methods effectively address the issue of catastrophic forgetting through prompt expansion and selection. However, existing approaches often suffer from low accuracy in prompt selection, which can result in the model receiving biased knowledge and making biased predictions. To address this issue, we propose the Multiple Queries with Multiple Keys (MQMK) prompt matching paradigm for precise prompt selection. The goal of MQMK is to select the prompts whose training data distribution most closely matches that of the test sample. Specifically, Multiple Queries enable precise breadth search by introducing task-specific knowledge, while Multiple Keys perform deep search by representing the feature distribution of training samples at a fine-grained level. Experiments show that MQMK enhances the prompt matching rate by over 30% in challenging scenarios and achieves state-of-the-art performance on three widely adopted continual learning benchmarks. Once this paper is accepted, we will release the code.
Chinese: 提出的多查询多密钥(MQMK)范式通过实现精确的广度和深度搜索,显著提升了持续学习中的提示选择准确性,在基准测试中实现了超过30%的匹配率提升和最先进的性能。
English: The proposed Multiple Queries with Multiple Keys (MQMK) paradigm significantly improves prompt selection accuracy in continual learning by enabling precise breadth and depth searches, achieving over 30% higher matching rates and state-of-the-art performance on benchmarks.

Authors:Holger Boche, Christian Deppe, Safieh Mahmoodi, Golamreza Omidi
Title: Galaxy Codes: Advancing Achievability for Deterministic Identification via Gaussian Channels
Abstract:
Deterministic identification offers an efficient solution for scenarios where decoding entire messages is unnecessary. It is commonly used in alarm systems and control systems. A key advantage of this approach is that the capacity for deterministic identification in Gaussian channels with power constraints grows superexponentially, unlike Shannon's transmission capacity. This allows for a significantly higher number of messages to be transmitted using this event-driven method. So far, only upper and lower bounds for deterministic identification capacity have been established. Our work introduces a novel construction: galaxy codes for deterministic identification. Using these codes, we demonstrate an improvement in the achievability bound of 1/4 to 3/8, representing a previously unknown advance that opens new possibilities for efficient communication.
Chinese: 确定性识别通过事件驱动方法在高斯信道中实现远超传统方式的超指数级消息容量,我们新提出的星系码将可达界从1/4提升至3/8,为高效通信开辟了新途径。
English: Deterministic identification enables efficient event-driven communication by allowing superexponentially more messages in Gaussian channels than traditional methods, with our newly developed galaxy codes improving the achievability bound from 1/4 to 3/8.

Authors:Benjamin Kiefer, Lojze Žust, Jon Muhovič, Matej Kristan, Janez Perš, Matija Teršek, Uma Mudenagudi Chaitra Desai, Arnold Wiliem, Marten Kreis, Nikhil Akalwadi, Yitong Quan, Zhiqiang Zhong, Zhe Zhang, Sujie Liu, Xuran Chen, Yang Yang, Matej Fabijanić, Fausto Ferreira, Seongju Lee, Junseok Lee, Kyoobin Lee, Shanliang Yao, Runwei Guan, Xiaoyu Huang, Yi Ni, Himanshu Kumar, Yuan Feng, Yi-Ching Cheng, Tzu-Yu Lin, Chia-Ming Lee, Chih-Chung Hsu, Jannik Sheikh, Andreas Michel, Wolfgang Gross, Martin Weinmann, Josip Šarić, Yipeng Lin, Xiang Yang, Nan Jiang, Yutang Lu, Fei Feng, Ali Awad, Evan Lucas, Ashraf Saleem, Ching-Heng Cheng, Yu-Fan Lin, Tzu-Yu Lin, Chih-Chung Hsu
Title: 3rd Workshop on Maritime Computer Vision (MaCVi) 2025: Challenge Results
Abstract:
The 3rd Workshop on Maritime Computer Vision (MaCVi) 2025 addresses maritime computer vision for Unmanned Surface Vehicles (USV) and underwater. This report offers a comprehensive overview of the findings from the challenges. We provide both statistical and qualitative analyses, evaluating trends from over 700 submissions. All datasets, evaluation code, and the leaderboard are available to the public at https://macvi.org/workshop/macvi25.
中文: 第三届海上计算机视觉研讨会(MaCVi 2025)致力于推进无人水面艇和水下计算机视觉技术,通过对700多份提交成果的统计与定性分析,全面总结了挑战赛的研究发现。
English: The 3rd Workshop on Maritime Computer Vision (MaCVi) 2025 focuses on advancing maritime computer vision for Unmanned Surface Vehicles and underwater applications, analyzing over 700 submissions through statistical and qualitative methods.

Authors:Bailiang Jian, Jiazhen Pan, Yitong Li, Fabian Bongratz, Ruochen Li, Daniel Rueckert, Benedikt Wiestler, Christian Wachinger
Title: TimeFlow: Temporal Conditioning for Longitudinal Brain MRI Registration and Aging Analysis
Abstract:
Longitudinal brain analysis is essential for understanding healthy aging and identifying pathological deviations. Longitudinal registration of sequential brain MRI underpins such analyses. However, existing methods are limited by reliance on densely sampled time series, a trade-off between accuracy and temporal smoothness, and an inability to prospectively forecast future brain states. To overcome these challenges, we introduce \emph{TimeFlow}, a learning-based framework for longitudinal brain MRI registration. TimeFlow uses a U-Net backbone with temporal conditioning to model neuroanatomy as a continuous function of age. Given only two scans from an individual, TimeFlow estimates accurate and temporally coherent deformation fields, enabling non-linear extrapolation to predict future brain states. This is achieved by our proposed inter-/extra-polation consistency constraints applied to both the deformation fields and deformed images. Remarkably, these constraints preserve temporal consistency and continuity without requiring explicit smoothness regularizers or densely sampled sequential data. Extensive experiments demonstrate that TimeFlow outperforms state-of-the-art methods in terms of both future timepoint forecasting and registration accuracy. Moreover, TimeFlow supports novel biological brain aging analyses by differentiating neurodegenerative trajectories from normal aging without requiring segmentation, thereby eliminating the need for labor-intensive annotations and mitigating segmentation inconsistency. TimeFlow offers an accurate, data-efficient, and annotation-free framework for longitudinal analysis of brain aging and chronic diseases, capable of forecasting brain changes beyond the observed study period.
中文摘要:TimeFlow是一种创新的学习型框架,仅需两次脑部MRI扫描即可实现精准配准并预测未来脑状态,在预测精度和配准效果上均优于现有方法,且无需密集时间序列数据或人工标注。
English Summary: TimeFlow is a novel learning-based framework that accurately registers and forecasts future brain states from just two MRI scans, outperforming existing methods in both predictive accuracy and registration without needing dense time series or annotations.

Authors:Juntao Jiang, Jiangning Zhang, Weixuan Liu, Muxuan Gao, Xiaobin Hu, Xiaoxiao Yan, Feiyue Huang, Yong Liu
Title: RWKV-UNet: Improving UNet with Long-Range Cooperation for Effective Medical Image Segmentation
Abstract:
In recent years, there have been significant advancements in deep learning for medical image analysis, especially with convolutional neural networks (CNNs) and transformer models. However, CNNs face limitations in capturing long-range dependencies while transformers suffer high computational complexities. To address this, we propose RWKV-UNet, a novel model that integrates the RWKV (Receptance Weighted Key Value) structure into the U-Net architecture. This integration enhances the model's ability to capture long-range dependencies and improve contextual understanding, which is crucial for accurate medical image segmentation. We build a strong encoder with developed inverted residual RWKV (IR-RWKV) blocks combining CNNs and RWKVs. We also propose a Cross-Channel Mix (CCM) module to improve skip connections with multi-scale feature fusion, achieving global channel information integration. Experiments on benchmark datasets, including Synapse, ACDC, BUSI, CVC-ClinicDB, CVC-ColonDB, Kvasir-SEG, ISIC 2017 and GLAS show that RWKV-UNet achieves state-of-the-art performance on various types of medical image segmentation. Additionally, smaller variants, RWKV-UNet-S and RWKV-UNet-T, balance accuracy and computational efficiency, making them suitable for broader clinical applications.
Chinese Summary: RWKV-UNet模型将RWKV结构与U-Net结合,增强了医学图像分割中的长距离依赖捕捉和上下文理解能力,在多个基准数据集上达到最优性能,并提供适用于临床的高效变体。
English Summary: The RWKV-UNet model integrates the RWKV structure with U-Net to enhance long-range dependency capture and contextual understanding in medical image segmentation, achieving state-of-the-art performance across multiple benchmark datasets while offering efficient variants for clinical use.

Authors:Chia-Ming Lee, Yu-Fan Lin, Li-Wei Kang, Chih-Chung Hsu
Title: Robust Hyperspectral Image Panshapring via Sparse Spatial-Spectral Representation
Abstract:
High-resolution hyperspectral imaging plays a crucial role in various remote sensing applications, yet its acquisition often faces fundamental limitations due to hardware constraints. This paper introduces S$^{3}$RNet, a novel framework for hyperspectral image pansharpening that effectively combines low-resolution hyperspectral images (LRHSI) with high-resolution multispectral images (HRMSI) through sparse spatial-spectral representation. The core of S$^{3}$RNet is the Multi-Branch Fusion Network (MBFN), which employs parallel branches to capture complementary features at different spatial and spectral scales. Unlike traditional approaches that treat all features equally, our Spatial-Spectral Attention Weight Block (SSAWB) dynamically adjusts feature weights to maintain sparse representation while suppressing noise and redundancy. To enhance feature propagation, we incorporate the Dense Feature Aggregation Block (DFAB), which efficiently aggregates inputted features through dense connectivity patterns. This integrated design enables S$^{3}$RNet to selectively emphasize the most informative features from differnt scale while maintaining computational efficiency. Comprehensive experiments demonstrate that S$^{3}$RNet achieves state-of-the-art performance across multiple evaluation metrics, showing particular strength in maintaining high reconstruction quality even under challenging noise conditions. The code will be made publicly available.
中文: 本文提出S$^{3}$RNet框架,通过稀疏空谱表示有效融合低分辨率高光谱与高分辨率多光谱图像,在保持计算效率的同时实现了最优性能,尤其在噪声环境下展现出卓越的重建质量。
English: This paper presents S$^{3}$RNet, a novel hyperspectral image pansharpening framework that effectively fuses low-resolution hyperspectral and high-resolution multispectral images through sparse spatial-spectral representation, achieving state-of-the-art performance with enhanced noise robustness.

Authors:Song-Lin Lv, Yu-Yang Chen, Zhi Zhou, Ming Yang, Lan-Zhe Guo
Title: BMIP: Bi-directional Modality Interaction Prompt Learning for VLM
Abstract:
Vision-language models (VLMs) have exhibited remarkable generalization capabilities, and prompt learning for VLMs has attracted great attention for the ability to adapt pre-trained VLMs to specific downstream tasks. However, existing studies mainly focus on single-modal prompts or uni-directional modality interaction, overlooking the powerful alignment effects resulting from the interaction between the vision and language modalities. To this end, we propose a novel prompt learning method called $\underline{\textbf{B}}i-directional \underline{\textbf{M}}odality \underline{\textbf{I}}nteraction \underline{\textbf{P}}rompt (BMIP)$, which dynamically weights bi-modal information through learning the information of the attention layer, enhancing trainability and inter-modal consistency compared to simple information aggregation methods. To evaluate the effectiveness of prompt learning methods, we propose a more realistic evaluation paradigm called open-world generalization complementing the widely adopted cross-dataset transfer and domain generalization tasks. Comprehensive experiments on various datasets reveal that BMIP not only outperforms current state-of-the-art methods across all three evaluation paradigms but is also flexible enough to be combined with other prompt-based methods for consistent performance enhancement.
中文: 提出的双向模态交互提示(BMIP)方法通过动态加权双模态信息增强了视觉语言模型的适应能力,在包括新提出的开放世界泛化测试在内的多种评估范式下均优于现有方法。
English: The proposed Bi-directional Modality Interaction Prompt (BMIP) method enhances vision-language model adaptation by dynamically weighting bi-modal information, outperforming existing approaches across multiple evaluation paradigms including a newly introduced open-world generalization test.

Authors:Chongming Gao, Kexin Huang, Ziang Fei, Jiaju Chen, Jiawei Chen, Jianshan Sun, Shuchang Liu, Qingpeng Cai, Peng Jiang
Title: Future-Conditioned Recommendations with Multi-Objective Controllable Decision Transformer
Abstract:
Securing long-term success is the ultimate aim of recommender systems, demanding strategies capable of foreseeing and shaping the impact of decisions on future user satisfaction. Current recommendation strategies grapple with two significant hurdles. Firstly, the future impacts of recommendation decisions remain obscured, rendering it impractical to evaluate them through direct optimization of immediate metrics. Secondly, conflicts often emerge between multiple objectives, like enhancing accuracy versus exploring diverse recommendations. Existing strategies, trapped in a "training, evaluation, and retraining" loop, grow more labor-intensive as objectives evolve. To address these challenges, we introduce a future-conditioned strategy for multi-objective controllable recommendations, allowing for the direct specification of future objectives and empowering the model to generate item sequences that align with these goals autoregressively. We present the Multi-Objective Controllable Decision Transformer (MocDT), an offline Reinforcement Learning (RL) model capable of autonomously learning the mapping from multiple objectives to item sequences, leveraging extensive offline data. Consequently, it can produce recommendations tailored to any specified objectives during the inference stage. Our empirical findings emphasize the controllable recommendation strategy's ability to produce item sequences according to different objectives while maintaining performance that is competitive with current recommendation strategies across various objectives.
Chinese: 本文提出了一种面向未来的多目标可控推荐策略,通过多目标可控决策变换器(MocDT)能够直接指定未来目标并自主生成符合这些目标的物品序列,从而克服现有推荐系统在长期效果预测和多目标平衡方面的局限性。
English: This paper introduces a future-conditioned multi-objective controllable recommendation strategy that overcomes current limitations by enabling direct specification of future objectives and autonomously generating item sequences aligned with these goals through the Multi-Objective Controllable Decision Transformer (MocDT).

Authors:Jiayang Wu, Wensheng Gan, Huashen Lu, Philip S. Yu
Title: Graph Contrastive Learning on Multi-label Classification for Recommendations
Abstract:
In business analysis, providing effective recommendations is essential for enhancing company profits. The utilization of graph-based structures, such as bipartite graphs, has gained popularity for their ability to analyze complex data relationships. Link prediction is crucial for recommending specific items to users. Traditional methods in this area often involve identifying patterns in the graph structure or using representational techniques like graph neural networks (GNNs). However, these approaches encounter difficulties as the volume of data increases. To address these challenges, we propose a model called Graph Contrastive Learning for Multi-label Classification (MCGCL). MCGCL leverages contrastive learning to enhance recommendation effectiveness. The model incorporates two training stages: a main task and a subtask. The main task is holistic user-item graph learning to capture user-item relationships. The homogeneous user-user (item-item) subgraph is constructed to capture user-user and item-item relationships in the subtask. We assessed the performance using real-world datasets from Amazon Reviews in multi-label classification tasks. Comparative experiments with state-of-the-art methods confirm the effectiveness of MCGCL, highlighting its potential for improving recommendation systems.
中文: 提出的MCGCL模型通过对比学习和双训练阶段提升推荐系统中的链接预测效果,在真实数据集上的实验证明其性能优于现有先进方法。
English: The proposed MCGCL model uses contrastive learning with dual training stages to enhance link prediction in recommendation systems, demonstrating superior performance on real-world datasets compared to existing methods.

Authors:Zhen Hong, Bowen Wang, Haoran Duan, Yawen Huang, Xiong Li, Zhenyu Wen, Xiang Wu, Wei Xiang, Yefeng Zheng
Title: SP-SLAM: Neural Real-Time Dense SLAM With Scene Priors
Abstract:
Neural implicit representations have recently shown promising progress in dense Simultaneous Localization And Mapping (SLAM). However, existing works have shortcomings in terms of reconstruction quality and real-time performance, mainly due to inflexible scene representation strategy without leveraging any prior information. In this paper, we introduce SP-SLAM, a novel neural RGB-D SLAM system that performs tracking and mapping in real-time. SP-SLAM computes depth images and establishes sparse voxel-encoded scene priors near the surfaces to achieve rapid convergence of the model. Subsequently, the encoding voxels computed from single-frame depth image are fused into a global volume, which facilitates high-fidelity surface reconstruction. Simultaneously, we employ tri-planes to store scene appearance information, striking a balance between achieving high-quality geometric texture mapping and minimizing memory consumption. Furthermore, in SP-SLAM, we introduce an effective optimization strategy for mapping, allowing the system to continuously optimize the poses of all historical input frames during runtime without increasing computational overhead. We conduct extensive evaluations on five benchmark datasets (Replica, ScanNet, TUM RGB-D, Synthetic RGB-D, 7-Scenes). The results demonstrate that, compared to existing methods, we achieve superior tracking accuracy and reconstruction quality, while running at a significantly faster speed.
中文摘要:SP-SLAM是一种新型神经RGB-D SLAM系统,通过稀疏体素编码场景先验和三平面外观存储技术,在实现实时追踪的同时完成高保真重建,在精度与速度上均超越现有方法。
English Summary: SP-SLAM is a novel neural RGB-D SLAM system that achieves real-time tracking and high-fidelity reconstruction through sparse voxel-encoded scene priors and tri-plane appearance storage, outperforming existing methods in both accuracy and speed.

Authors:Chia-Ming Lee, Yu-Fan Lin, Yu-Hao Ho, Li-Wei Kang, Chih-Chung Hsu
Title: HyFusion: Enhanced Reception Field Transformer for Hyperspectral Image Fusion
Abstract:
Hyperspectral image (HSI) fusion addresses the challenge of reconstructing High-Resolution HSIs (HR-HSIs) from High-Resolution Multispectral images (HR-MSIs) and Low-Resolution HSIs (LR-HSIs), a critical task given the high costs and hardware limitations associated with acquiring high-quality HSIs. While existing methods leverage spatial and spectral relationships, they often suffer from limited receptive fields and insufficient feature utilization, leading to suboptimal performance. Furthermore, the scarcity of high-quality HSI data highlights the importance of efficient data utilization to maximize reconstruction quality. To address these issues, we propose HyFusion, a novel Dual-Coupled Network (DCN) framework designed to enhance cross-domain feature extraction and enable effective feature map reusing. The framework first processes HR-MSI and LR-HSI inputs through specialized subnetworks that mutually enhance each other during feature extraction, preserving complementary spatial and spectral details. At its core, HyFusion utilizes an Enhanced Reception Field Block (ERFB), which combines shifting-window attention and dense connections to expand the receptive field, effectively capturing long-range dependencies while minimizing information loss. Extensive experiments demonstrate that HyFusion achieves state-of-the-art performance in HR-MSI/LR-HSI fusion, significantly improving reconstruction quality while maintaining a compact model size and computational efficiency. By integrating enhanced receptive fields and feature map reusing into a coupled network architecture, HyFusion provides a practical and effective solution for HSI fusion in resource-constrained scenarios, setting a new benchmark in hyperspectral imaging. Our code will be publicly available.
Chinese: HyFusion提出了一种双耦合网络,通过增强感受野模块和特征重用技术,有效融合高分辨率多光谱与低分辨率高光谱图像,以优异的重建性能和计算效率设定了高光谱成像新标杆。
English: HyFusion introduces a dual-coupled network with enhanced receptive field blocks and feature reuse to effectively reconstruct high-resolution hyperspectral images from multispectral and low-resolution inputs, achieving state-of-the-art performance with computational efficiency.

Authors:Michael Kölle, Johannes Tochtermann, Julian Schönberger, Gerhard Stenzel, Philipp Altmann, Claudia Linnhoff-Popien
Title: PIMAEX: Multi-Agent Exploration through Peer Incentivization
Abstract:
While exploration in single-agent reinforcement learning has been studied extensively in recent years, considerably less work has focused on its counterpart in multi-agent reinforcement learning. To address this issue, this work proposes a peer-incentivized reward function inspired by previous research on intrinsic curiosity and influence-based rewards. The \textit{PIMAEX} reward, short for Peer-Incentivized Multi-Agent Exploration, aims to improve exploration in the multi-agent setting by encouraging agents to exert influence over each other to increase the likelihood of encountering novel states. We evaluate the \textit{PIMAEX} reward in conjunction with \textit{PIMAEX-Communication}, a multi-agent training algorithm that employs a communication channel for agents to influence one another. The evaluation is conducted in the \textit{Consume/Explore} environment, a partially observable environment with deceptive rewards, specifically designed to challenge the exploration vs.\ exploitation dilemma and the credit-assignment problem. The results empirically demonstrate that agents using the \textit{PIMAEX} reward with \textit{PIMAEX-Communication} outperform those that do not.
中文: 本研究提出PIMAEX奖励函数和PIMAEX-Communication算法,通过促进智能体间的相互影响来增强多智能体探索能力,在具有欺骗性奖励的环境中实证表明该方法具有更优性能。
English: This work introduces the PIMAEX reward function and PIMAEX-Communication algorithm to enhance multi-agent exploration by promoting inter-agent influence, with empirical results showing superior performance in deceptive reward environments.

Authors:Miao Yu, Junfeng Fang, Yingjie Zhou, Xing Fan, Kun Wang, Shirui Pan, Qingsong Wen
Title: LLM-Virus: Evolutionary Jailbreak Attack on Large Language Models
Abstract:
While safety-aligned large language models (LLMs) are increasingly used as the cornerstone for powerful systems such as multi-agent frameworks to solve complex real-world problems, they still suffer from potential adversarial queries, such as jailbreak attacks, which attempt to induce harmful content. Researching attack methods allows us to better understand the limitations of LLM and make trade-offs between helpfulness and safety. However, existing jailbreak attacks are primarily based on opaque optimization techniques (e.g. token-level gradient descent) and heuristic search methods like LLM refinement, which fall short in terms of transparency, transferability, and computational cost. In light of these limitations, we draw inspiration from the evolution and infection processes of biological viruses and propose LLM-Virus, a jailbreak attack method based on evolutionary algorithm, termed evolutionary jailbreak. LLM-Virus treats jailbreak attacks as both an evolutionary and transfer learning problem, utilizing LLMs as heuristic evolutionary operators to ensure high attack efficiency, transferability, and low time cost. Our experimental results on multiple safety benchmarks show that LLM-Virus achieves competitive or even superior performance compared to existing attack methods.
中文摘要:尽管安全对齐的大型语言模型仍存在越狱攻击风险,但新提出的基于进化算法的LLM-Virus攻击方法在多个安全基准测试中展现出更高的攻击效率、可迁移性和优越性能。
English Summary: Safety-aligned large language models remain vulnerable to jailbreak attacks, prompting the development of LLM-Virus, an evolutionary algorithm-based method that demonstrates superior efficiency, transferability, and performance across multiple safety benchmarks.

Authors:Aurora Rofena, Claudia Lucia Piccolo, Bruno Beomonte Zobel, Paolo Soda, Valerio Guarrasi
Title: Augmented Intelligence for Multimodal Virtual Biopsy in Breast Cancer Using Generative Artificial Intelligence
Abstract:
Full-Field Digital Mammography (FFDM) is the primary imaging modality for routine breast cancer screening; however, its effectiveness is limited in patients with dense breast tissue or fibrocystic conditions. Contrast-Enhanced Spectral Mammography (CESM), a second-level imaging technique, offers enhanced accuracy in tumor detection. Nonetheless, its application is restricted due to higher radiation exposure, the use of contrast agents, and limited accessibility. As a result, CESM is typically reserved for select cases, leaving many patients to rely solely on FFDM despite the superior diagnostic performance of CESM. While biopsy remains the gold standard for definitive diagnosis, it is an invasive procedure that can cause discomfort for patients. We introduce a multimodal, multi-view deep learning approach for virtual biopsy, integrating FFDM and CESM modalities in craniocaudal and mediolateral oblique views to classify lesions as malignant or benign. To address the challenge of missing CESM data, we leverage generative artificial intelligence to impute CESM images from FFDM scans. Experimental results demonstrate that incorporating the CESM modality is crucial to enhance the performance of virtual biopsy. When real CESM data is missing, synthetic CESM images proved effective, outperforming the use of FFDM alone, particularly in multimodal configurations that combine FFDM and CESM modalities. The proposed approach has the potential to improve diagnostic workflows, providing clinicians with augmented intelligence tools to improve diagnostic accuracy and patient care. Additionally, as a contribution to the research community, we publicly release the dataset used in our experiments, facilitating further advancements in this field.
中文: 通过融合全视野数字乳腺摄影和对比增强光谱乳腺摄影的多模态深度学习模型,能够显著提高虚拟活检对乳腺病变良恶性分类的准确性,在缺乏真实CESM数据时,生成的合成图像可有效替代并优于单一FFDM检测。
English: A multimodal deep learning model that integrates Full-Field Digital Mammography and Contrast-Enhanced Spectral Mammography views significantly enhances virtual biopsy accuracy for breast lesion classification, with synthetic CESM data effectively compensating when real CESM is unavailable.

Authors:Chunming He, Rihan Zhang, Fengyang Xiao, Chengyu Fang, Longxiang Tang, Yulun Zhang, Linghe Kong, Deng-Ping Fan, Kai Li, Sina Farsiu
Title: RUN: Reversible Unfolding Network for Concealed Object Segmentation
Abstract:
Existing concealed object segmentation (COS) methods frequently utilize reversible strategies to address uncertain regions. However, these approaches are typically restricted to the mask domain, leaving the potential of the RGB domain underexplored. To address this, we propose the Reversible Unfolding Network (RUN), which applies reversible strategies across both mask and RGB domains through a theoretically grounded framework, enabling accurate segmentation. RUN first formulates a novel COS model by incorporating an extra residual sparsity constraint to minimize segmentation uncertainties. The iterative optimization steps of the proposed model are then unfolded into a multistage network, with each step corresponding to a stage. Each stage of RUN consists of two reversible modules: the Segmentation-Oriented Foreground Separation (SOFS) module and the Reconstruction-Oriented Background Extraction (ROBE) module. SOFS applies the reversible strategy at the mask level and introduces Reversible State Space to capture non-local information. ROBE extends this to the RGB domain, employing a reconstruction network to address conflicting foreground and background regions identified as distortion-prone areas, which arise from their separate estimation by independent modules. As the stages progress, RUN gradually facilitates reversible modeling of foreground and background in both the mask and RGB domains, directing the network's attention to uncertain regions and mitigating false-positive and false-negative results. Extensive experiments demonstrate the superior performance of RUN and highlight the potential of unfolding-based frameworks for COS and other high-level vision tasks. We will release the code and models.
Chinese: 提出的可逆展开网络(RUN)通过在掩码和RGB双域应用可逆策略的多阶段架构来处理分割不确定性,实现了优异的隐蔽物体分割性能。
English: The proposed Reversible Unfolding Network (RUN) applies reversible strategies across both mask and RGB domains to address segmentation uncertainties through a multistage architecture, achieving superior concealed object segmentation performance.

Authors:Chan-Jan Hsu, Yi-Cheng Lin, Chia-Chun Lin, Wei-Chih Chen, Ho Lam Chung, Chen-An Li, Yi-Chang Chen, Chien-Yu Yu, Ming-Ji Lee, Chien-Cheng Chen, Ru-Heng Huang, Hung-yi Lee, Da-Shan Shiu
Title: BreezyVoice: Adapting TTS for Taiwanese Mandarin with Enhanced Polyphone Disambiguation -- Challenges and Insights
Abstract:
We present BreezyVoice, a Text-to-Speech (TTS) system specifically adapted for Taiwanese Mandarin, highlighting phonetic control abilities to address the unique challenges of polyphone disambiguation in the language. Building upon CosyVoice, we incorporate a $S^{3}$ tokenizer, a large language model (LLM), an optimal-transport conditional flow matching model (OT-CFM), and a grapheme to phoneme prediction model, to generate realistic speech that closely mimics human utterances. Our evaluation demonstrates BreezyVoice's superior performance in both general and code-switching contexts, highlighting its robustness and effectiveness in generating high-fidelity speech. Additionally, we address the challenges of generalizability in modeling long-tail speakers and polyphone disambiguation. Our approach significantly enhances performance and offers valuable insights into the workings of neural codec TTS systems.
Chinese: BreezyVoice是一款专为台湾普通话设计的语音合成系统,通过先进模型有效解决多音字歧义问题,在常规和语码转换场景中均表现出卓越的语音生成能力。
English: BreezyVoice is a Taiwanese Mandarin TTS system that excels in polyphone disambiguation and generates highly realistic speech using advanced models, demonstrating superior performance in general and code-switched contexts.

Authors:Tuong Do, Nghia Vu, Tudor Jianu, Baoru Huang, Minh Vu, Jionglong Su, Erman Tjiputra, Quang D. Tran, Te-Chuan Chiu, Anh Nguyen
Title: FedEFM: Federated Endovascular Foundation Model with Unseen Data
Abstract:
In endovascular surgery, the precise identification of catheters and guidewires in X-ray images is essential for reducing intervention risks. However, accurately segmenting catheter and guidewire structures is challenging due to the limited availability of labeled data. Foundation models offer a promising solution by enabling the collection of similar domain data to train models whose weights can be fine-tuned for downstream tasks. Nonetheless, large-scale data collection for training is constrained by the necessity of maintaining patient privacy. This paper proposes a new method to train a foundation model in a decentralized federated learning setting for endovascular intervention. To ensure the feasibility of the training, we tackle the unseen data issue using differentiable Earth Mover's Distance within a knowledge distillation framework. Once trained, our foundation model's weights provide valuable initialization for downstream tasks, thereby enhancing task-specific performance. Intensive experiments show that our approach achieves new state-of-the-art results, contributing to advancements in endovascular intervention and robotic-assisted endovascular surgery, while addressing the critical issue of data sharing in the medical domain.
Chinese: 本文提出了一种基于知识蒸馏框架和可微地球移动距离的去中心化联邦学习方法,用于训练血管内手术中导管和导丝分割的基础模型,在保护患者隐私的同时取得了最先进的性能。
English: This paper introduces a decentralized federated learning method using a knowledge distillation framework with differentiable Earth Mover's Distance to train a foundation model for catheter and guidewire segmentation in endovascular surgery, achieving state-of-the-art results while preserving patient privacy.

Authors:Yuxin Zhang, Minyan Luo, Weiming Dong, Xiao Yang, Haibin Huang, Chongyang Ma, Oliver Deussen, Tong-Yee Lee, Changsheng Xu
Title: IP-Prompter: Training-Free Theme-Specific Image Generation via Dynamic Visual Prompting
Abstract:
The stories and characters that captivate us as we grow up shape unique fantasy worlds, with images serving as the primary medium for visually experiencing these realms. Personalizing generative models through fine-tuning with theme-specific data has become a prevalent approach in text-to-image generation. However, unlike object customization, which focuses on learning specific objects, theme-specific generation encompasses diverse elements such as characters, scenes, and objects. Such diversity also introduces a key challenge: how to adaptively generate multi-character, multi-concept, and continuous theme-specific images (TSI). Moreover, fine-tuning approaches often come with significant computational overhead, time costs, and risks of overfitting. This paper explores a fundamental question: Can image generation models directly leverage images as contextual input, similarly to how large language models use text as context? To address this, we present IP-Prompter, a novel training-free TSI generation method. IP-Prompter introduces visual prompting, a mechanism that integrates reference images into generative models, allowing users to seamlessly specify the target theme without requiring additional training. To further enhance this process, we propose a Dynamic Visual Prompting (DVP) mechanism, which iteratively optimizes visual prompts to improve the accuracy and quality of generated images. Our approach enables diverse applications, including consistent story generation, character design, realistic character generation, and style-guided image generation. Comparative evaluations against state-of-the-art personalization methods demonstrate that IP-Prompter achieves significantly better results and excels in maintaining character identity preserving, style consistency and text alignment, offering a robust and flexible solution for theme-specific image generation.
中文: 本文提出IP-Prompter这一免训练的主题图像生成方法,通过视觉提示机制直接利用参考图像生成多概念主题图像,无需微调即可在保持角色一致性和风格统一性方面显著优于现有个性化方法。
English: This paper introduces IP-Prompter, a training-free method for theme-specific image generation that uses visual prompts from reference images to create consistent multi-concept images without fine-tuning, outperforming existing personalization techniques in quality and efficiency.

Authors:Ke Xu, Weizhi Zhang, Zihe Song, Yuanjie Zhu, Philip S. Yu
Title: Graph Neural Controlled Differential Equations For Collaborative Filtering
Abstract:
Graph Convolution Networks (GCNs) are widely considered state-of-the-art for recommendation systems. Several studies in the field of recommendation systems have attempted to apply collaborative filtering (CF) into the Neural ODE framework. These studies follow the same idea as LightGCN, which removes the weight matrix or with a discrete weight matrix. However, we argue that weight control is critical for neural ODE-based methods. The importance of weight in creating tailored graph convolution for each node is crucial, and employing a fixed/discrete weight means it cannot adjust over time within the ODE function. This rigidity in the graph convolution reduces its adaptability, consequently hindering the performance of recommendations. In this study, to create an optimal control for Neural ODE-based recommendation, we introduce a new method called Graph Neural Controlled Differential Equations for Collaborative Filtering (CDE-CF). Our method improves the performance of the Graph ODE-based method by incorporating weight control in a continuous manner. To evaluate our approach, we conducted experiments on various datasets. The results show that our method surpasses competing baselines, including GCNs-based models and state-of-the-art Graph ODE-based methods.
中文: 本研究提出CDE-CF方法,通过引入连续权重控制改进基于神经ODE的推荐系统,解决了图卷积中固定权重导致的适应性问题,实验证明其性能优于现有模型。
English: This study introduces CDE-CF, a method that enhances Neural ODE-based recommendation systems by incorporating continuous weight control to address the limitations of fixed weights in graph convolutions, demonstrating superior performance over existing models.

Authors:Rui Wang, Mingxuan Xia, Chang Yao, Lei Feng, Junbo Zhao, Gang Chen, Haobo Wang
Title: Towards Robust Incremental Learning under Ambiguous Supervision
Abstract:
Traditional Incremental Learning (IL) targets to handle sequential fully-supervised learning problems where novel classes emerge from time to time. However, due to inherent annotation uncertainty and ambiguity, collecting high-quality annotated data in a dynamic learning system can be extremely expensive. To mitigate this problem, we propose a novel weakly-supervised learning paradigm called Incremental Partial Label Learning (IPLL), where the sequentially arrived data relate to a set of candidate labels rather than the ground truth. Technically, we develop the Prototype-Guided Disambiguation and Replay Algorithm (PGDR) which leverages the class prototypes as a proxy to mitigate two intertwined challenges in IPLL, i.e., label ambiguity and catastrophic forgetting. To handle the former, PGDR encapsulates a momentum-based pseudo-labeling algorithm along with prototype-guided initialization, resulting in a balanced perception of classes. To alleviate forgetting, we develop a memory replay technique that collects well-disambiguated samples while maintaining representativeness and diversity. By jointly distilling knowledge from curated memory data, our framework exhibits a great disambiguation ability for samples of new tasks and achieves less forgetting of knowledge. Extensive experiments demonstrate that PGDR achieves superior
Chinese: 提出的增量部分标签学习(IPLL)范式及其原型引导消歧与回放算法(PGDR)通过利用类别原型和记忆回放技术,有效解决了序列学习中的标签歧义和灾难性遗忘问题,在实验中表现出优越性能。
English: The proposed Incremental Partial Label Learning (IPLL) paradigm and its Prototype-Guided Disambiguation and Replay Algorithm (PGDR) effectively address label ambiguity and catastrophic forgetting in sequential learning by leveraging class prototypes and memory replay, achieving superior performance in experiments.

Authors:Xianrui Luo, Juewen Peng, Zhongang Cai, Lei Yang, Fan Yang, Zhiguo Cao, Guosheng Lin
Title: Deblur-Avatar: Animatable Avatars from Motion-Blurred Monocular Videos
Abstract:
We introduce a novel framework for modeling high-fidelity, animatable 3D human avatars from motion-blurred monocular video inputs. Motion blur is prevalent in real-world dynamic video capture, especially due to human movements in 3D human avatar modeling. Existing methods either (1) assume sharp image inputs, failing to address the detail loss introduced by motion blur, or (2) mainly consider blur by camera movements, neglecting the human motion blur which is more common in animatable avatars. Our proposed approach integrates a human movement-based motion blur model into 3D Gaussian Splatting (3DGS). By explicitly modeling human motion trajectories during exposure time, we jointly optimize the trajectories and 3D Gaussians to reconstruct sharp, high-quality human avatars. We employ a pose-dependent fusion mechanism to distinguish moving body regions, optimizing both blurred and sharp areas effectively. Extensive experiments on synthetic and real-world datasets demonstrate that our method significantly outperforms existing methods in rendering quality and quantitative metrics, producing sharp avatar reconstructions and enabling real-time rendering under challenging motion blur conditions.
中文: 本文提出了一种新颖框架,将基于人体运动的模糊模型集成到3D高斯泼溅中,能够从运动模糊视频中重建清晰的动态3D虚拟形象,在渲染质量和实时性能上显著优于现有方法。
English: This paper presents a novel framework that integrates a human motion-based blur model into 3D Gaussian Splatting to reconstruct sharp, animatable 3D avatars from motion-blurred videos, significantly outperforming existing methods in rendering quality and real-time performance.

Authors:Laurynas Karazija, Iro Laina, Christian Rupprecht, Andrea Vedaldi
Title: Learning segmentation from point trajectories
Abstract:
We consider the problem of segmenting objects in videos based on their motion and no other forms of supervision. Prior work has often approached this problem by using the principle of common fate, namely the fact that the motion of points that belong to the same object is strongly correlated. However, most authors have only considered instantaneous motion from optical flow. In this work, we present a way to train a segmentation network using long-term point trajectories as a supervisory signal to complement optical flow. The key difficulty is that long-term motion, unlike instantaneous motion, is difficult to model -- any parametric approximation is unlikely to capture complex motion patterns over long periods of time. We instead draw inspiration from subspace clustering approaches, proposing a loss function that seeks to group the trajectories into low-rank matrices where the motion of object points can be approximately explained as a linear combination of other point tracks. Our method outperforms the prior art on motion-based segmentation, which shows the utility of long-term motion and the effectiveness of our formulation.
Chinese: 本研究提出一种利用长期点轨迹结合光流训练分割网络的方法,通过子空间聚类启发的损失函数将轨迹分组为低秩矩阵,从而在基于运动的无监督视频对象分割中提升了性能。
English: This work introduces a method for unsupervised video object segmentation by training a network using long-term point trajectories alongside optical flow, employing a subspace clustering-inspired loss to group trajectories into low-rank matrices for improved motion-based segmentation.

Authors:Tairan Fu, Javier Conde, Gonzalo Martínez, María Grandury, Pedro Reviriego
Title: Multiple Choice Questions: Reasoning Makes Large Language Models (LLMs) More Self-Confident Even When They Are Wrong
Abstract:
One of the most widely used methods to evaluate LLMs are Multiple Choice Question (MCQ) tests. MCQ benchmarks enable the testing of LLM knowledge on almost any topic at scale as the results can be processed automatically. To help the LLM answer, a few examples called few shots can be included in the prompt. Moreover, the LLM can be asked to answer the question directly with the selected option or to first provide the reasoning and then the selected answer, which is known as chain of thought. In addition to checking whether the selected answer is correct, the evaluation can look at the LLM-estimated probability of its response as an indication of the confidence of the LLM in the response. In this paper, we study how the LLM confidence in its answer depends on whether the model has been asked to answer directly or to provide the reasoning before answering. The results of the evaluation of questions on a wide range of topics in seven different models show that LLMs are more confident in their answers when they provide reasoning before the answer. This occurs regardless of whether the selected answer is correct. Our hypothesis is that this behavior is due to the reasoning that modifies the probability of the selected answer, as the LLM predicts the answer based on the input question and the reasoning that supports the selection made. Therefore, LLM estimated probabilities seem to have intrinsic limitations that should be understood in order to use them in evaluation procedures. Interestingly, the same behavior has been observed in humans, for whom explaining an answer increases confidence in its correctness.
中文: 大语言模型在回答问题前提供推理时,无论答案正确与否,都表现出更高的自信心,这揭示了使用其估计概率进行评估的内在局限性。
English: Large language models exhibit higher confidence in their answers when providing reasoning before responding, regardless of accuracy, revealing intrinsic limitations in using their estimated probabilities for evaluation.

Authors:Beatrice Savoldi, Giuseppe Attanasio, Eleonora Cupin, Eleni Gkovedarou, Janiça Hackenbuchner, Anne Lauscher, Matteo Negri, Andrea Piergentili, Manjinder Thind, Luisa Bentivogli
Title: Mind the Inclusivity Gap: Multilingual Gender-Neutral Translation Evaluation with mGeNTE
Abstract:
Avoiding the propagation of undue (binary) gender inferences and default masculine language remains a key challenge towards inclusive multilingual technologies, particularly when translating into languages with extensive gendered morphology. Gender-neutral translation (GNT) represents a linguistic strategy towards fairer communication across languages. However, research on GNT is limited to a few resources and language pairs. To address this gap, we introduce mGeNTE, an expert-curated resource, and use it to conduct the first systematic multilingual evaluation of inclusive translation with state-of-the-art instruction-following language models (LMs). Experiments on en-es/de/it/el reveal that while models can recognize when neutrality is appropriate, they cannot consistently produce neutral translations, limiting their usability. To probe this behavior, we enrich our evaluation with interpretability analyses that identify task-relevant features and offer initial insights into the internal dynamics of LM-based GNT.
中文:包容性多语言技术在避免性别推断和默认男性化语言方面面临挑战,尤其在翻译到性别形态丰富的语言时,现有模型虽能识别中性翻译的适用场景,却无法稳定生成中性译文,即使借助mGeNTE等专家构建的资源实现了首次系统性评估。
English: Inclusive multilingual technologies face challenges in avoiding gender inferences and default masculine language, especially in translations to heavily gendered languages, with current models recognizing but inconsistently producing neutral translations despite expert-curated resources like mGeNTE enabling systematic evaluation.

Authors:Haozhe Xie, Zhaoxi Chen, Fangzhou Hong, Ziwei Liu
Title: Compositional Generative Model of Unbounded 4D Cities
Abstract:
3D scene generation has garnered growing attention in recent years and has made significant progress. Generating 4D cities is more challenging than 3D scenes due to the presence of structurally complex, visually diverse objects like buildings and vehicles, and heightened human sensitivity to distortions in urban environments. To tackle these issues, we propose CityDreamer4D, a compositional generative model specifically tailored for generating unbounded 4D cities. Our main insights are 1) 4D city generation should separate dynamic objects (e.g., vehicles) from static scenes (e.g., buildings and roads), and 2) all objects in the 4D scene should be composed of different types of neural fields for buildings, vehicles, and background stuff. Specifically, we propose Traffic Scenario Generator and Unbounded Layout Generator to produce dynamic traffic scenarios and static city layouts using a highly compact BEV representation. Objects in 4D cities are generated by combining stuff-oriented and instance-oriented neural fields for background stuff, buildings, and vehicles. To suit the distinct characteristics of background stuff and instances, the neural fields employ customized generative hash grids and periodic positional embeddings as scene parameterizations. Furthermore, we offer a comprehensive suite of datasets for city generation, including OSM, GoogleEarth, and CityTopia. The OSM dataset provides a variety of real-world city layouts, while the Google Earth and CityTopia datasets deliver large-scale, high-quality city imagery complete with 3D instance annotations. Leveraging its compositional design, CityDreamer4D supports a range of downstream applications, such as instance editing, city stylization, and urban simulation, while delivering state-of-the-art performance in generating realistic 4D cities.
Chinese: CityDreamer4D是一种组合式生成模型,通过分离动态物体与静态场景并采用专门设计的神经场,有效解决了4D城市生成的挑战,在创建逼真无边界城市环境方面实现了最先进的性能。
English: CityDreamer4D is a compositional generative model that addresses the challenges of 4D city generation by separating dynamic objects from static scenes and using specialized neural fields, achieving state-of-the-art performance in creating realistic unbounded urban environments.

Authors:Wenxuan Zeng, Ye Dong, Jinjin Zhou, Junming Ma, Jin Tan, Runsheng Wang, Meng Li
Title: MPCache: MPC-Friendly KV Cache Eviction for Efficient Private Large Language Model Inference
Abstract:
Private large language model (LLM) inference based on secure multi-party computation (MPC) offers cryptographically-secure protection for both user prompt and proprietary model weights. However, it suffers from large latency overhead especially for long input sequences. While key-value (KV) cache eviction algorithms have been proposed to reduce the computation and memory cost for plaintext inference, they are not designed for MPC and cannot benefit private inference easily. In this paper, we propose an accurate and MPC-friendly KV cache eviction framework, dubbed MPCache. MPCache is built on the observation that historical tokens in a long sequence may have different effects on the downstream decoding. Hence, MPCache combines a look-once static eviction algorithm to discard unimportant tokens and a query-aware dynamic selection algorithm to further select a small subset of tokens for attention computation. As existing dynamic selection algorithms incur too much latency, we propose a series of optimizations to drastically reduce the KV cache selection overhead, including MPC-friendly similarity approximation, hierarchical KV cache clustering, and cross-layer index sharing strategy. With extensive experiments, we demonstrate that MPCache consistently outperforms prior-art KV cache eviction baselines across different LLM generation tasks and achieves 1.8~2.01x and 3.39~8.37x decoding latency and communication reduction on different sequence lengths, respectively.
中文摘要:MPCache是一种创新的KV缓存淘汰框架,通过结合静态与动态令牌选择算法来优化私有大语言模型的推理效率,在保持精度的同时显著降低了延迟和通信开销。
English Summary: MPCache is a novel KV cache eviction framework that enhances private LLM inference by combining static and dynamic token selection algorithms, significantly reducing latency and communication overhead while maintaining accuracy.

Authors:Weitian Zhang, Yichao Yan, Sijing Wu, Manwen Liao, Xiaokang Yang
Title: Disentangled Clothed Avatar Generation with Layered Representation
Abstract:
Clothed avatar generation has wide applications in virtual and augmented reality, filmmaking, and more. Previous methods have achieved success in generating diverse digital avatars, however, generating avatars with disentangled components (\eg, body, hair, and clothes) has long been a challenge. In this paper, we propose LayerAvatar, the first feed-forward diffusion-based method for generating component-disentangled clothed avatars. To achieve this, we first propose a layered UV feature plane representation, where components are distributed in different layers of the Gaussian-based UV feature plane with corresponding semantic labels. This representation supports high-resolution and real-time rendering, as well as expressive animation including controllable gestures and facial expressions. Based on the well-designed representation, we train a single-stage diffusion model and introduce constrain terms to address the severe occlusion problem of the innermost human body layer. Extensive experiments demonstrate the impressive performances of our method in generating disentangled clothed avatars, and we further explore its applications in component transfer. The project page is available at: https://olivia23333.github.io/LayerAvatar/
中文: LayerAvatar首次提出基于前馈扩散的分层UV特征平面表示方法,通过单阶段扩散模型生成组件解耦的着装虚拟人,在解决严重遮挡问题的同时支持高分辨率实时渲染与富有表现力的动画。
English: LayerAvatar introduces a novel feed-forward diffusion-based approach that utilizes a layered UV feature plane representation to generate component-disentangled clothed avatars, enabling high-resolution rendering and expressive animations while addressing occlusion challenges.

Authors:Daniele Molino, Francesco Di Feola, Eliodoro Faiella, Deborah Fazzini, Domiziana Santucci, Linlin Shen, Valerio Guarrasi, Paolo Soda
Title: XGeM: A Multi-Prompt Foundation Model for Multimodal Medical Data Generation
Abstract:
The adoption of Artificial Intelligence in medical imaging holds great promise, yet it remains hindered by challenges such as data scarcity, privacy concerns, and the need for robust multimodal integration. While recent advances in generative modeling have enabled high-quality synthetic data generation, existing approaches are often limited to unimodal, unidirectional synthesis and therefore lack the ability to jointly synthesize multiple modalities while preserving clinical consistency. To address this challenge, we introduce XGeM, a 6.77-billion-parameter multimodal generative model designed to support flexible, any-to-any synthesis between medical data modalities. XGeM constructs a shared latent space via contrastive learning and introduces a novel Multi-Prompt Training strategy, enabling conditioning on arbitrary subsets of input modalities. This design allows the model to adapt to heterogeneous clinical inputs and generate multiple outputs jointly, preserving both semantic and structural coherence. We extensively validate XGeM: first we benchmark it against five competitors on the MIMIC-CXR dataset, a state-of-the-art dataset for multi-view Chest X-ray and radiological report generation. Secondly, we perform a Visual Turing Test with expert radiologists to assess the realism and clinical relevance of the generated data, ensuring alignment with real-world scenarios. Finally, we show how XGeM can support key medical data challenges such as anonymization, class imbalance, and data scarcity, underscoring its utility as a foundation model for medical data synthesis. Project page is at https://cosbidev.github.io/XGeM/.
中文: XGeM是一个67.7亿参数的多模态生成模型,通过构建共享潜空间实现医疗数据模态间的灵活双向合成,在保持临床一致性的同时有效解决了数据稀缺和隐私保护等关键医学挑战。
English: XGeM is a 6.77-billion-parameter multimodal generative model that enables flexible any-to-any synthesis between medical data modalities while preserving clinical consistency, addressing key challenges like data scarcity and privacy through validated performance and expert assessment.

Authors:Tudor Jianu, Shayan Doust, Mengyun Li, Baoru Huang, Tuong Do, Hoan Nguyen, Karl Bates, Tung D. Ta, Sebastiano Fichera, Pierre Berthet-Rayne, Anh Nguyen
Title: SplineFormer: An Explainable Transformer-Based Approach for Autonomous Endovascular Navigation
Abstract:
Endovascular navigation is a crucial aspect of minimally invasive procedures, where precise control of curvilinear instruments like guidewires is critical for successful interventions. A key challenge in this task is accurately predicting the evolving shape of the guidewire as it navigates through the vasculature, which presents complex deformations due to interactions with the vessel walls. Traditional segmentation methods often fail to provide accurate real-time shape predictions, limiting their effectiveness in highly dynamic environments. To address this, we propose SplineFormer, a new transformer-based architecture, designed specifically to predict the continuous, smooth shape of the guidewire in an explainable way. By leveraging the transformer's ability, our network effectively captures the intricate bending and twisting of the guidewire, representing it as a spline for greater accuracy and smoothness. We integrate our SplineFormer into an end-to-end robot navigation system by leveraging the condensed information. The experimental results demonstrate that our SplineFormer is able to perform endovascular navigation autonomously and achieves a 50% success rate when cannulating the brachiocephalic artery on the real robot.
中文:SplineFormer是一种基于Transformer的新架构,能通过样条曲线精确预测导丝形状,实现自主血管内导航,并在真实机器人动脉插管中达到50%的成功率。
English: SplineFormer, a transformer-based architecture, accurately predicts guidewire shapes as splines for autonomous endovascular navigation, achieving a 50% success rate in real robot artery cannulation.

Authors:Ngoc Dung Huynh, Mohamed Reda Bouadjenek, Sunil Aryal, Imran Razzak, Hakim Hacid
Title: Visual question answering: from early developments to recent advances -- a survey
Abstract:
Visual Question Answering (VQA) is an evolving research field aimed at enabling machines to answer questions about visual content by integrating image and language processing techniques such as feature extraction, object detection, text embedding, natural language understanding, and language generation. With the growth of multimodal data research, VQA has gained significant attention due to its broad applications, including interactive educational tools, medical image diagnosis, customer service, entertainment, and social media captioning. Additionally, VQA plays a vital role in assisting visually impaired individuals by generating descriptive content from images. This survey introduces a taxonomy of VQA architectures, categorizing them based on design choices and key components to facilitate comparative analysis and evaluation. We review major VQA approaches, focusing on deep learning-based methods, and explore the emerging field of Large Visual Language Models (LVLMs) that have demonstrated success in multimodal tasks like VQA. The paper further examines available datasets and evaluation metrics essential for measuring VQA system performance, followed by an exploration of real-world VQA applications. Finally, we highlight ongoing challenges and future directions in VQA research, presenting open questions and potential areas for further development. This survey serves as a comprehensive resource for researchers and practitioners interested in the latest advancements and future
中文: 本综述全面介绍了视觉问答(VQA)领域,系统阐述了其架构分类、深度学习方法、数据集应用以及多模态人工智能研究中的未来挑战。
English: This survey provides a comprehensive overview of Visual Question Answering (VQA), detailing its architectures, deep learning methods, datasets, applications, and future challenges in multimodal AI research.

Authors:Wali Ullah Khan, Eva Lagunas, Symeon Chatzinotas
Title: Transmissive Beyond Diagonal RIS-Mounted LEO Communication for NOMA IoT Networks
Abstract:
Reconfigurable Intelligent Surface (RIS) technology has emerged as a transformative solution for enhancing satellite networks in next-generation wireless communication. The integration of RIS in satellite networks addresses critical challenges such as limited spectrum resources and high path loss, making it an ideal candidate for next-generation Internet of Things (IoT) networks. This paper provides a new framework based on transmissive beyond diagonal RIS (T-BD-RIS) mounted low earth orbit (LEO) satellite networks with non-orthogonal multiple access (NOMA). The NOMA power allocation at LEO and phase shift design at T-BD-RIS are optimized to maximize the system's spectral efficiency. The optimization problem is formulated as non-convex, which is first transformed using successive convex approximation and then divided into two problems. A closed-form solution is obtained for LEO satellite transmit power using KKT conditions, and a semi-definite relaxation approach is adopted for the T-BD-RIS phase shift design. Numerical results are obtained based on Monte Carlo simulations, which demonstrate the advantages of T-BD-RIS in satellite networks.
中文: 本文提出了一种基于透射型超对角智能反射面(T-BD-RIS)的低轨卫星网络与非正交多址接入(NOMA)相结合的新框架,通过凸近似和半定松弛方法优化功率分配与相位偏移,有效提升了系统频谱效率。
English: This paper introduces a novel framework using transmissive beyond diagonal RIS (T-BD-RIS) in LEO satellite networks with NOMA, optimizing power allocation and phase shifts to maximize spectral efficiency through convex approximation and semi-definite relaxation methods.

Authors:Chao Wang, Licheng Jiao, Jiaxuan Zhao, Lingling Li, Fang Liu, Shuyuan Yang
Title: Learning Evolution via Optimization Knowledge Adaptation
Abstract:
Evolutionary algorithms (EAs) maintain populations through evolutionary operators to discover diverse solutions for complex tasks while gathering valuable knowledge, such as historical population data and fitness evaluations. However, traditional EAs face challenges in dynamically adapting to expanding knowledge bases, hindering the efficient exploitation of accumulated information and limiting adaptability to new situations. To address these issues, we introduce an Optimization Knowledge Adaptation Evolutionary Model (OKAEM), which features dynamic parameter adjustment using accumulated knowledge to enhance its optimization capabilities. OKAEM employs attention mechanisms to model the interactions among individuals, fitness landscapes, and genetic components separately, thereby parameterizing the evolutionary operators of selection, crossover, and mutation. These powerful learnable operators enable OKAEM to benefit from pre-learned extensive prior knowledge and self-tune with real-time evolutionary insights. Experimental results demonstrate that OKAEM: 1) exploits prior knowledge for significant performance gains across various knowledge transfer settings; 2) achieves competitive performance through self-tuning alone, even without prior knowledge; 3) outperforms state-of-the-art black-box baselines in a vision-language model tuning case; 4) can improve its optimization capabilities with growing knowledge; 5) is capable of emulating principles of natural selection and genetic recombination.
中文: 提出的优化知识适应进化模型(OKAEM)通过积累知识和注意力机制动态调整进化算子,在知识利用、自调优能力和多场景适应性方面展现出卓越性能。
English: The proposed Optimization Knowledge Adaptation Evolutionary Model (OKAEM) dynamically adjusts evolutionary operators using accumulated knowledge and attention mechanisms, demonstrating superior performance in knowledge exploitation, self-tuning capability, and adaptability across various optimization scenarios.

Authors:Jiayun Wang, Oleksii Ostras, Masashi Sode, Bahareh Tolooshams, Zongyi Li, Kamyar Azizzadenesheli, Gianmarco Pinton, Anima Anandkumar
Title: Ultrasound Lung Aeration Map via Physics-Aware Neural Operators
Abstract:
Lung ultrasound is a growing modality in clinics for diagnosing and monitoring acute and chronic lung diseases due to its low cost and accessibility. Lung ultrasound works by emitting diagnostic pulses, receiving pressure waves and converting them into radio frequency (RF) data, which are then processed into B-mode images with beamformers for radiologists to interpret. However, unlike conventional ultrasound for soft tissue anatomical imaging, lung ultrasound interpretation is complicated by complex reverberations from the pleural interface caused by the inability of ultrasound to penetrate air. The indirect B-mode images make interpretation highly dependent on reader expertise, requiring years of training, which limits its widespread use despite its potential for high accuracy in skilled hands. To address these challenges and democratize ultrasound lung imaging as a reliable diagnostic tool, we propose LUNA, an AI model that directly reconstructs lung aeration maps from RF data, bypassing the need for traditional beamformers and indirect interpretation of B-mode images. LUNA uses a Fourier neural operator, which processes RF data efficiently in Fourier space, enabling accurate reconstruction of lung aeration maps. LUNA offers a quantitative, reader-independent alternative to traditional semi-quantitative lung ultrasound scoring methods. The development of LUNA involves synthetic and real data: We simulate synthetic data with an experimentally validated approach and scan ex vivo swine lungs as real data. Trained on abundant simulated data and fine-tuned with a small amount of real-world data, LUNA achieves robust performance, demonstrated by an aeration estimation error of 9% in ex-vivo lung scans. We demonstrate the potential of reconstructing lung aeration maps from RF data, providing a foundation for improving lung ultrasound reproducibility and diagnostic utility.
中文:LUNA是一种人工智能模型,它通过傅里叶神经算子直接从射频数据重建肺部通气图,为传统B超解读提供了定量化、非依赖医师经验的替代方案,从而提升诊断准确性与普及性。
English: LUNA is an AI model that reconstructs lung aeration maps directly from radio frequency data using a Fourier neural operator, offering a quantitative and reader-independent alternative to traditional B-mode ultrasound interpretation to improve diagnostic accuracy and accessibility.

Authors:Zhen Tao, Shidong Pan, Zhenchang Xing, Xiaoyu Sun, Omar Haggag, John Grundy, Jingjie Li, Liming Zhu
Title: Privacy Bills of Materials: A Transparent Privacy Information Inventory for Collaborative Privacy Notice Generation in Mobile App Development
Abstract:
Privacy regulations mandate that developers must provide authentic and comprehensive privacy notices, e.g., privacy policies or labels, to inform users of their apps' privacy practices. However, due to a lack of knowledge of privacy requirements, developers often struggle to create accurate privacy notices, especially for sophisticated mobile apps with complex features and in crowded development teams. To address these challenges, we introduce Privacy Bills of Materials (PriBOM), a systematic software engineering approach that leverages different development team roles to better capture and coordinate mobile app privacy information. PriBOM facilitates transparency-centric privacy documentation and specific privacy notice creation, enabling traceability and trackability of privacy practices. We present a pre-fill of PriBOM based on static analysis and privacy notice analysis techniques. We demonstrate the perceived usefulness of PriBOM through a human evaluation with 150 diverse participants. Our findings suggest that PriBOM could serve as a significant solution for providing privacy support in DevOps for mobile apps.
中文摘要:PriBOM作为一种系统化软件工程方法,通过整合开发团队角色和实现隐私实践可追溯性,帮助开发者创建准确隐私声明,人类评估证实了其实际应用价值。
English Summary: Privacy Bills of Materials (PriBOM) is introduced as a systematic software engineering approach to help developers create accurate privacy notices by leveraging team roles and enabling traceability of privacy practices, with human evaluation demonstrating its perceived usefulness.

Authors:Zhaoyu Chen, Haijing Guo, Kaixun Jiang, Jiyuan Fu, Xinyu Zhou, Dingkang Yang, Hao Tang, Bo Li, Wenqiang Zhang
Title: Boosting Adversarial Transferability with Spatial Adversarial Alignment
Abstract:
Deep neural networks are vulnerable to adversarial examples that exhibit transferability across various models. Numerous approaches are proposed to enhance the transferability of adversarial examples, including advanced optimization, data augmentation, and model modifications. However, these methods still show limited transferability, particularly in cross-architecture scenarios, such as from CNN to ViT. To achieve high transferability, we propose a technique termed Spatial Adversarial Alignment (SAA), which employs an alignment loss and leverages a witness model to fine-tune the surrogate model. Specifically, SAA consists of two key parts: spatial-aware alignment and adversarial-aware alignment. First, we minimize the divergences of features between the two models in both global and local regions, facilitating spatial alignment. Second, we introduce a self-adversarial strategy that leverages adversarial examples to impose further constraints, aligning features from an adversarial perspective. Through this alignment, the surrogate model is trained to concentrate on the common features extracted by the witness model. This facilitates adversarial attacks on these shared features, thereby yielding perturbations that exhibit enhanced transferability. Extensive experiments on various architectures on ImageNet show that aligned surrogate models based on SAA can provide higher transferable adversarial examples, especially in cross-architecture attacks.
中文摘要:本文提出空间对抗对齐(SAA)技术,通过空间感知对齐和对抗感知对齐,使替代模型聚焦于见证模型提取的共性特征,从而显著提升跨架构对抗样本的迁移性。
English Summary: This paper introduces Spatial Adversarial Alignment (SAA), a technique that enhances adversarial example transferability across different model architectures by aligning features spatially and adversarially between surrogate and witness models.

Authors:Zeping Sui, Hien Quoc Ngo, Michail Matthaiou, Lajos Hanzo
Title: Performance Analysis and Optimization of STAR-RIS-Aided Cell-Free Massive MIMO Systems Relying on Imperfect Hardware
Abstract:
Simultaneously transmitting and reflecting reconfigurable intelligent surface (STAR-RIS)-aided cell-free massive multiple-input multiple-output (CF-mMIMO) systems are investigated under spatially correlated fading channels using realistic imperfect hardware. Specifically, the transceiver distortions, \textcolor{black}{time-varying phase noise, and RIS phase shift errors} are considered. Upon considering imperfect hardware and pilot contamination, we derive a linear minimum mean-square error (MMSE) criterion-based cascaded channel estimator. Moreover, a closed-form expression of the downlink ergodic spectral efficiency (SE) is derived based on maximum ratio (MR) based transmit precoding and channel statistics, where both a finite number of access points (APs) and STAR-RIS elements as well as imperfect hardware are considered. Furthermore, by exploiting the ergodic signal-to-interference-plus-noise ratios (SINRs) among user equipment (UE), a max-min fairness problem is formulated for the joint optimization of the passive transmitting and reflecting beamforming (BF) at the STAR-RIS as well as of the power control coefficients. An alternating optimization (AO) algorithm is proposed for solving the resultant problems, where iterative adaptive particle swarm optimization (APSO) and bisection methods are proposed for circumventing the non-convexity of the RIS passive BF and the quasi-concave power control sub-problems, respectively. Our simulation results illustrate that the STAR-RIS-aided CF-mMIMO system attains higher SE than its RIS-aided counterpart. The performance of different hardware parameters is also evaluated. Additionally, it is demonstrated that the SE of the worst UE can be significantly improved by exploiting the proposed AO-based algorithm compared to conventional solutions associated with random passive BF and equal-power scenarios.
中文摘要:本研究在考虑实际硬件缺陷条件下,通过开发信道估计和波束成形优化方法,证明了STAR-RIS辅助的无蜂窝大规模MIMO系统相比传统RIS系统具有更优越的频谱效率。
English Summary: This study investigates STAR-RIS-aided cell-free massive MIMO systems under realistic imperfect hardware conditions, developing channel estimation and beamforming optimization methods that demonstrate superior spectral efficiency compared to conventional RIS systems.

Authors:Marios Constantinides, Daniele Quercia
Title: AI, Jobs, and the Automation Trap: Where Is HCI?
Abstract:
As artificial intelligence (AI) continues to reshape the workforce, its current trajectory raises pressing questions about its ultimate purpose. Why does job automation dominate the agenda, even at the expense of human agency and equity? This paper critiques the automation-centric paradigm, arguing that current reward structures, which largely focus on cost reduction, drive the overwhelming emphasis on task replacement in AI patents. Meanwhile, Human-Centered AI (HCAI), which envisions AI as a collaborator augmenting human capabilities and aligning with societal values, remains a fugitive from the mainstream narrative. Despite its promise, HCAI has gone ``missing'', with little evidence of its principles translating into patents or real-world impact. To increase impact, actionable interventions are needed to disrupt existing incentive structures within the HCI community. We call for a shift in priorities to support translational research, foster cross-disciplinary collaboration, and promote metrics that reward tangible and real-world impact.
中文摘要:本文批判了当前以自动化为中心的AI发展模式,指出现有激励机制过度强调岗位替代而非人本主义路径,呼吁通过结构性改革推动增强人类能力并符合社会价值的AI发展。
English Summary: This paper critiques the automation-focused trajectory of AI development, arguing that current incentives prioritize job replacement over human-centered approaches, and calls for structural changes to promote AI that augments human capabilities and aligns with societal values.

Authors:Mingyu Derek Ma, Yanna Ding, Zijie Huang, Jianxi Gao, Yizhou Sun, Wei Wang
Title: Inferring from Logits: Exploring Best Practices for Decoding-Free Generative Candidate Selection
Abstract:
Generative Language Models rely on autoregressive decoding to produce the output sequence token by token. Many tasks such as preference optimization, require the model to produce task-level output consisting of multiple tokens directly by selecting candidates from a pool as predictions. Determining a task-level prediction from candidates using the ordinary token-level decoding mechanism is constrained by time-consuming decoding and interrupted gradients by discrete token selection. Existing works have been using decoding-free candidate selection methods to obtain candidate probability from initial output logits over vocabulary. Though these estimation methods are widely used, they are not systematically evaluated, especially on end tasks. We introduce an evaluation of a comprehensive collection of decoding-free candidate selection approaches on a comprehensive set of tasks, including five multiple-choice QA tasks with a small candidate pool and four clinical decision tasks with a massive amount of candidates, some with 10k+ options. We evaluate the estimation methods paired with a wide spectrum of foundation LMs covering different architectures, sizes and training paradigms. The results and insights from our analysis inform the future model design.
Chinese: 本研究系统评估了多种免解码候选选择方法在不同任务和基础语言模型上的表现,为未来模型设计提供了重要见解。
English: This study systematically evaluates various decoding-free candidate selection methods across multiple tasks and foundation language models, providing insights for future model design.

Authors:Xinfa Zhu, Wenjie Tian, Xinsheng Wang, Lei He, Xi Wang, Sheng Zhao, Lei Xie
Title: CosyAudio: Improving Audio Generation with Confidence Scores and Synthetic Captions
Abstract:
Text-to-Audio (TTA) generation is an emerging area within AI-generated content (AIGC), where audio is created from natural language descriptions. Despite growing interest, developing robust TTA models remains challenging due to the scarcity of well-labeled datasets and the prevalence of noisy or inaccurate captions in large-scale, weakly labeled corpora. To address these challenges, we propose CosyAudio, a novel framework that utilizes confidence scores and synthetic captions to enhance the quality of audio generation. CosyAudio consists of two core components: AudioCapTeller and an audio generator. AudioCapTeller generates synthetic captions for audio and provides confidence scores to evaluate their accuracy. The audio generator uses these synthetic captions and confidence scores to enable quality-aware audio generation. Additionally, we introduce a self-evolving training strategy that iteratively optimizes CosyAudio across both well-labeled and weakly-labeled datasets. Initially trained with well-labeled data, AudioCapTeller leverages its assessment capabilities on weakly-labeled datasets for high-quality filtering and reinforcement learning, which further improves its performance. The well-trained AudioCapTeller refines corpora by generating new captions and confidence scores, serving for the audio generator training. Extensive experiments on open-source datasets demonstrate that CosyAudio outperforms existing models in automated audio captioning, generates more faithful audio, and exhibits strong generalization across diverse scenarios.
中文: CosyAudio是一种新颖的文本到音频生成框架,通过置信度评分和合成字幕提升生成质量,其自我演进的训练策略在多种场景下均能生成更逼真的音频,性能优于现有模型。
English: CosyAudio is a novel framework that enhances text-to-audio generation by using confidence scores and synthetic captions, featuring a self-evolving training strategy that outperforms existing models in generating faithful audio across diverse scenarios.

Authors:Mude Hui, Rui-Jie Zhu, Songlin Yang, Yu Zhang, Zirui Wang, Yuyin Zhou, Jason Eshraghian, Cihang Xie
Title: ARFlow: Autoregressive Flow with Hybrid Linear Attention
Abstract:
Flow models are effective at progressively generating realistic images, but they generally struggle to capture long-range dependencies during the generation process as they compress all the information from previous time steps into a single corrupted image. To address this limitation, we propose integrating autoregressive modeling -- known for its excellence in modeling complex, high-dimensional joint probability distributions -- into flow models. During training, at each step, we construct causally-ordered sequences by sampling multiple images from the same semantic category and applying different levels of noise, where images with higher noise levels serve as causal predecessors to those with lower noise levels. This design enables the model to learn broader category-level variations while maintaining proper causal relationships in the flow process. During generation, the model autoregressively conditions the previously generated images from earlier denoising steps, forming a contextual and coherent generation trajectory. Additionally, we design a customized hybrid linear attention mechanism tailored to our modeling approach to enhance computational efficiency. Our approach, termed ARFlow, achieves 6.63 FID scores on ImageNet at 256 * 256 without classifier-free guidance, reaching 1.96 FID with classifier-free guidance 1.5, outperforming the previous flow-based model SiT's 2.06 FID. Extensive ablation studies demonstrate the effectiveness of our modeling strategy and chunk-wise attention design.
中文: ARFlow模型通过训练时构建因果序列和生成时利用前序去噪图像,将自回归建模融入流模型,其定制注意力机制和建模策略在ImageNet上实现了领先的FID分数。
English: The proposed ARFlow model integrates autoregressive modeling into flow models by constructing causally-ordered sequences during training and conditioning on previous denoising steps during generation, achieving state-of-the-art FID scores on ImageNet through its novel architecture and attention mechanism.

Authors:Wei Shi, Jiacheng Yao, Wei Xu, Jindan Xu, Xiaohu You, Yonina C. Eldar, Chunming Zhao
Title: Combating Interference for Over-the-Air Federated Learning: A Statistical Approach via RIS
Abstract:
Over-the-air computation (AirComp) integrates analog communication with task-oriented computation, serving as a key enabling technique for communication-efficient federated learning (FL) over wireless networks. However, owing to its analog characteristics, AirComp-enabled FL (AirFL) is vulnerable to both unintentional and intentional interference. In this paper, we aim to attain robustness in AirComp aggregation against interference via reconfigurable intelligent surface (RIS) technology to artificially reconstruct wireless environments. Concretely, we establish performance objectives tailored for interference suppression in wireless FL systems, aiming to achieve unbiased gradient estimation and reduce its mean square error (MSE). Oriented at these objectives, we introduce the concept of phase-manipulated favorable propagation and channel hardening for AirFL, which relies on the adjustment of RIS phase shifts to realize statistical interference elimination and reduce the error variance of gradient estimation. Building upon this concept, we propose two robust aggregation schemes of power control and RIS phase shifts design, both ensuring unbiased gradient estimation in the presence of interference. Theoretical analysis of the MSE and FL convergence affirms the anti-interference capability of the proposed schemes. It is observed that computation and interference errors diminish by an order of $\mathcal{O}\left(\frac{1}{N}\right)$ where $N$ is the number of RIS elements, and the ideal convergence rate without interference can be asymptotically achieved by increasing $N$. Numerical results confirm the analytical results and validate the superior performance of the proposed schemes over existing baselines.
中文: 本文提出利用可重构智能表面技术实现抗干扰的空中联邦学习聚合方案,通过优化相位偏移确保梯度估计无偏,并显著降低计算误差,其误差随表面单元数量增加而按阶次减小。
English: This paper introduces robust aggregation schemes using reconfigurable intelligent surface (RIS) technology to mitigate interference in over-the-air federated learning, achieving unbiased gradient estimation and reducing computation errors proportional to the number of RIS elements.

Authors:Ying Zheng, Yiyi Zhang, Yi Wang, Lap-Pui Chau
Title: Fuzzy-aware Loss for Source-free Domain Adaptation in Visual Emotion Recognition
Abstract:
Source-free domain adaptation in visual emotion recognition (SFDA-VER) is a highly challenging task that requires adapting VER models to the target domain without relying on source data, which is of great significance for data privacy protection. However, due to the unignorable disparities between visual emotion data and traditional image classification data, existing SFDA methods perform poorly on this task. In this paper, we investigate the SFDA-VER task from a fuzzy perspective and identify two key issues: fuzzy emotion labels and fuzzy pseudo-labels. These issues arise from the inherent uncertainty of emotion annotations and the potential mispredictions in pseudo-labels. To address these issues, we propose a novel fuzzy-aware loss (FAL) to enable the VER model to better learn and adapt to new domains under fuzzy labels. Specifically, FAL modifies the standard cross entropy loss and focuses on adjusting the losses of non-predicted categories, which prevents a large number of uncertain or incorrect predictions from overwhelming the VER model during adaptation. In addition, we provide a theoretical analysis of FAL and prove its robustness in handling the noise in generated pseudo-labels. Extensive experiments on 26 domain adaptation sub-tasks across three benchmark datasets demonstrate the effectiveness of our method.
中文: 本文提出了一种模糊感知损失(FAL),通过有效处理视觉情感识别中的模糊情感标签和伪标签,解决了无源域自适应任务中的关键问题,显著提升了模型的适应能力和鲁棒性。
English: This paper introduces a fuzzy-aware loss (FAL) to tackle the challenges of source-free domain adaptation in visual emotion recognition by effectively managing fuzzy emotion and pseudo-labels, enhancing model adaptation and robustness across diverse datasets.

Authors:Ilya Orson Sandoval, Isaac Symes Thompson, Vasilios Mavroudis, Chris Hicks
Title: An Attentive Graph Agent for Topology-Adaptive Cyber Defence
Abstract:
As cyber threats grow increasingly sophisticated, reinforcement learning (RL) is emerging as a promising technique to create intelligent and adaptive cyber defense systems. However, most existing autonomous defensive agents have overlooked the inherent graph structure of computer networks subject to cyber attacks, potentially missing critical information and constraining their adaptability. To overcome these limitations, we developed a custom version of the Cyber Operations Research Gym (CybORG) environment, encoding network state as a directed graph with realistic low-level features. We employ a Graph Attention Network (GAT) architecture to process node, edge, and global features, and adapt its output to be compatible with policy gradient methods in RL. Our GAT-based approach offers key advantages over flattened alternatives: policies that demonstrate resilience to certain types of unexpected dynamic network topology changes, reasonable generalisation to networks of varying sizes within the same structural distribution, and interpretable defensive actions grounded in tangible network properties. We demonstrate that GAT defensive policies can be trained using our low-level directed graph observations, even when unexpected connections arise during simulation. Evaluations across networks of different sizes, but consistent subnetwork structure, show our policies achieve comparable performance to policies trained specifically for each network configuration. Our study contributes to the development of robust cyber defence systems that can better adapt to real-world network security challenges.
中文摘要:基于图注意力网络的强化学习网络防御系统能够利用网络固有图结构,更好地适应动态拓扑变化并在不同规模网络中实现泛化,从而克服传统扁平化方法的局限性。
English Summary: Reinforcement learning-based cyber defense systems using Graph Attention Networks can better adapt to dynamic network topologies and generalize across different network sizes by leveraging inherent graph structures, overcoming limitations of traditional flattened approaches.

Authors:Shahin Hakemi, Naveed Akhtar, Ghulam Mubashar Hassan, Ajmal Mian
Title: Post-hoc Spurious Correlation Neutralization with Single-Weight Fictitious Class Unlearning
Abstract:
Neural network training tends to exploit the simplest features as shortcuts to greedily minimize training loss. However, some of these features might be spuriously correlated with the target labels, leading to incorrect predictions by the model. Several methods have been proposed to address this issue. Focusing on suppressing the spurious correlations with model training, they not only incur additional training cost, but also have limited practical utility as the model misbehavior due to spurious relations is usually discovered after its deployment. It is also often overlooked that spuriousness is a subjective notion. Hence, the precise questions that must be investigated are; to what degree a feature is spurious, and how we can proportionally distract the model's attention from it for reliable prediction. To this end, we propose a method that enables post-hoc neutralization of spurious feature impact, controllable to an arbitrary degree. We conceptualize spurious features as fictitious sub-classes within the original classes, which can be eliminated by a class removal scheme. We then propose a unique precise class removal technique that employs a single-weight modification, which entails negligible performance compromise for the remaining classes. We perform extensive experiments, demonstrating that by editing just a single weight in a post-hoc manner, our method achieves highly competitive, or better performance against the state-of-the-art methods.
中文: 神经网络训练常利用虚假特征走捷径,但所提方法通过仅修改单一权重,可在事后按任意程度中和这些特征的影响,同时保持性能损失极小。
English: Neural networks often rely on spurious features for shortcuts during training, but the proposed method enables post-hoc neutralization of these features' impact with minimal performance compromise by editing just a single weight.

Authors:Shahin Hakemi, Naveed Akhtar, Ghulam Mubashar Hassan, Ajmal Mian
Title: Single-weight Model Editing for Post-hoc Spurious Correlation Neutralization
Abstract:
Neural network training tends to exploit the simplest features as shortcuts to greedily minimize training loss. However, some of these features might be spuriously correlated with the target labels, leading to incorrect predictions by the model. Several methods have been proposed to address this issue. Focusing on suppressing the spurious correlations with model training, they not only incur additional training cost, but also have limited practical utility as the model misbehavior due to spurious relations is usually discovered after its deployment. It is also often overlooked that spuriousness is a subjective notion. Hence, the precise questions that must be investigated are; to what degree a feature is spurious, and how we can proportionally distract the model's attention from it for reliable prediction. To this end, we propose a method that enables post-hoc neutralization of spurious feature impact, controllable to an arbitrary degree. We conceptualize spurious features as fictitious sub-classes within the original classes, which can be eliminated by a class removal scheme. We then propose a unique precise class removal technique that makes a single-weight modification, which entails negligible performance compromise for the remaining classes. We perform extensive experiments, demonstrating that by editing just a single weight in a post-hoc manner, our method achieves highly competitive, or better performance against the state-of-the-art methods.
中文: 神经网络训练常利用虚假特征走捷径,但所提方法通过仅修改单一权重,可在事后按任意程度中和这些特征的影响,同时保持性能损失极小。
English: Neural networks often rely on spurious features for shortcuts during training, but the proposed method enables post-hoc neutralization of these features' impact with minimal performance compromise by editing just a single weight.

Authors:Xuelong Geng, Kun Wei, Qijie Shao, Shuiyun Liu, Zhennan Lin, Zhixian Zhao, Guojian Li, Wenjie Tian, Peikun Chen, Yangze Li, Pengcheng Guo, Mingchen Shao, Shuiyuan Wang, Yuang Cao, Chengyou Wang, Tianyi Xu, Yuhang Dai, Xinfa Zhu, Yue Li, Li Zhang, Lei Xie
Title: OSUM: Advancing Open Speech Understanding Models with Limited Resources in Academia
Abstract:
Large Language Models (LLMs) have made significant progress in various downstream tasks, inspiring the development of Speech Understanding Language Models (SULMs) to enable comprehensive speech-based interactions. However, most advanced SULMs are developed by the industry, leveraging large-scale datasets and computational resources that are not readily available to the academic community. Moreover, the lack of transparency in training details creates additional barriers to further innovation. In this study, we present OSUM, an Open Speech Understanding Model designed to explore the potential of training SLUMs under constrained academic resources. The OSUM model combines a Whisper encoder with a Qwen2 LLM and supports a wide range of speech tasks, including speech recognition (ASR), speech recognition with timestamps (SRWT), vocal event detection (VED), speech emotion recognition (SER), speaking style recognition (SSR), speaker gender classification (SGC), speaker age prediction (SAP), and speech-to-text chat (STTC). By employing an ASR+X training strategy, OSUM achieves efficient and stable multi-task training by simultaneously optimizing ASR alongside target tasks. Beyond delivering strong performance, OSUM emphasizes transparency by providing openly available data preparation and training methodologies, offering valuable insights and practical guidance for the academic community. By doing so, we aim to accelerate research and innovation in advanced SULM technologies.
Chinese: 本研究提出了OSUM这一开放语音理解模型,它采用ASR+X训练策略,在学术资源有限的情况下高效支持多种语音任务,并通过公开数据与训练方法促进透明性,助力学术界的创新研究。
English: This study introduces OSUM, an open speech understanding model that efficiently trains on academic-scale resources using an ASR+X strategy to handle multiple speech tasks while promoting transparency and accessibility for the research community.

Authors:Minghao Fu, Biwei Huang, Zijian Li, Yujia Zheng, Ignavier Ng, Yingyao Hu, Kun Zhang
Title: Identification of Nonparametric Dynamic Causal Structure and Latent Process in Climate System
Abstract:
The study of learning causal structure with latent variables has advanced the understanding of the world by uncovering causal relationships and latent factors, e.g., Causal Representation Learning (CRL). However, in real-world scenarios, such as those in climate systems, causal relationships are often nonparametric, dynamic, and exist among both observed variables and latent variables. These challenges motivate us to consider a general setting in which causal relations are nonparametric and unrestricted in their occurrence, which is unconventional to current methods. To solve this problem, with the aid of 3-measurement in temporal structure, we theoretically show that both latent variables and processes can be identified up to minor indeterminacy under mild assumptions. Moreover, we tackle the general nonlinear Causal Discovery (CD) from observations, e.g., temperature, as a specific task of learning independent representation, through the principle of functional equivalence. Based on these insights, we develop an estimation approach simultaneously recovering both the observed causal structure and latent causal process in a nontrivial manner. Simulation studies validate the theoretical foundations and demonstrate the effectiveness of the proposed methodology. In the experiments involving climate data, this approach offers a powerful and in-depth understanding of the climate system.
中文: 本文提出一个统一框架,能够同时揭示观测变量间的因果关系与潜在驱动因素,建立了可识别性条件并开发了CaDRe模型,在气候分析中既实现了优越的预测性能,又生成了符合领域知识的可解释因果图。
English: This paper introduces a unified framework that simultaneously uncovers causal relations among observed variables and latent driving forces, establishing identifiability conditions and proposing the CaDRe model, which demonstrates competitive forecasting and interpretable causal graphs in climate analysis.

Authors:Xinglin Pan, Wenxiang Lin, Lin Zhang, Shaohuai Shi, Zhenheng Tang, Rui Wang, Bo Li, Xiaowen Chu
Title: FSMoE: A Flexible and Scalable Training System for Sparse Mixture-of-Experts Models
Abstract:
Recent large language models (LLMs) have tended to leverage sparsity to reduce computations, employing the sparsely activated mixture-of-experts (MoE) technique. MoE introduces four modules, including token routing, token communication, expert computation, and expert parallelism, that impact model quality and training efficiency. To enable versatile usage of MoE models, we introduce FSMoE, a flexible training system optimizing task scheduling with three novel techniques: 1) Unified abstraction and online profiling of MoE modules for task scheduling across various MoE implementations. 2) Co-scheduling intra-node and inter-node communications with computations to minimize communication overheads. 3) To support near-optimal task scheduling, we design an adaptive gradient partitioning method for gradient aggregation and a schedule to adaptively pipeline communications and computations. We conduct extensive experiments with configured MoE layers and real-world MoE models on two GPU clusters. Experimental results show that 1) our FSMoE supports four popular types of MoE routing functions and is more efficient than existing implementations (with up to a 1.42$\times$ speedup), and 2) FSMoE outperforms the state-of-the-art MoE training systems (DeepSpeed-MoE and Tutel) by 1.18$\times$-1.22$\times$ on 1458 MoE layers and 1.19$\times$-3.01$\times$ on real-world MoE models based on GPT-2 and Mixtral using a popular routing function.
中文摘要:FSMoE是一种灵活的稀疏激活专家混合模型训练系统,通过统一抽象、协同调度和自适应梯度划分优化任务调度,相比现有系统在多种模型上实现了显著的训练加速。
English Summary: FSMoE is a flexible training system that optimizes task scheduling for sparsely activated mixture-of-experts (MoE) models, achieving significant speed improvements over existing implementations through unified abstraction, co-scheduling, and adaptive gradient partitioning.

Authors:Gongxu Luo, Haoyue Dai, Boyang Sun, Loka Li, Biwei Huang, Petar Stojanov, Kun Zhang
Title: Gene Regulatory Network Inference in the Presence of Selection Bias and Latent Confounders
Abstract:
Gene Regulatory Network Inference (GRNI) aims to identify causal relationships among genes using gene expression data, providing insights into regulatory mechanisms. A significant yet often overlooked challenge is selection bias, a process where only cells meeting specific criteria, such as gene expression thresholds, survive or are observed, distorting the true joint distribution of genes and thus biasing GRNI results. Furthermore, gene expression is influenced by latent confounders, such as non-coding RNAs, which add complexity to GRNI. To address these challenges, we propose GISL (Gene Regulatory Network Inference in the presence of Selection bias and Latent confounders), a novel algorithm to infer true regulatory relationships in the presence of selection and confounding issues. Leveraging data obtained via multiple gene perturbation experiments, we show that the true regulatory relationships, as well as selection processes and latent confounders can be partially identified without strong parametric models and under mild graphical assumptions. Experimental results on both synthetic and real-world single-cell gene expression datasets demonstrate the superiority of GISL over existing methods.
Chinese: GISL算法通过解决选择偏差和潜在混杂因素,有效推断基因调控网络,在合成和真实单细胞基因表达数据集上均优于现有方法,且无需强参数模型。
English: The GISL algorithm effectively infers gene regulatory networks by addressing selection bias and latent confounders, outperforming existing methods on synthetic and real-world datasets without requiring strong parametric models.

Authors:Junteng Mao, Ziye Jia, Hanzhi Gu, Chenyu Shi, Haomin Shi, Lijun He, Qihui Wu
Title: Robust UAV Path Planning with Obstacle Avoidance for Emergency Rescue
Abstract:
The unmanned aerial vehicles (UAVs) are efficient tools for diverse tasks such as electronic reconnaissance, agricultural operations and disaster relief. In the complex three-dimensional (3D) environments, the path planning with obstacle avoidance for UAVs is a significant issue for security assurance. In this paper, we construct a comprehensive 3D scenario with obstacles and no-fly zones for dynamic UAV trajectory. Moreover, a novel artificial potential field algorithm coupled with simulated annealing (APF-SA) is proposed to tackle the robust path planning problem. APF-SA modifies the attractive and repulsive potential functions and leverages simulated annealing to escape local minimum and converge to globally optimal solutions. Simulation results demonstrate that the effectiveness of APF-SA, enabling efficient autonomous path planning for UAVs with obstacle avoidance.
中文: 本文提出了一种结合模拟退火的人工势场算法(APF-SA),通过改进势场函数和避免局部最优,实现了无人机在复杂三维环境中高效安全的自主路径规划与避障。
English: This paper introduces a hybrid APF-SA algorithm that enhances UAV path planning in complex 3D environments by optimizing potential functions and using simulated annealing to achieve global optimality with effective obstacle avoidance.

Authors:Xuanjun Chen, Jiawei Du, Haibin Wu, Lin Zhang, I-Ming Lin, I-Hsiang Chiu, Wenze Ren, Yuan Tseng, Yu Tsao, Jyh-Shing Roger Jang, Hung-yi Lee
Title: CodecFake+: A Large-Scale Neural Audio Codec-Based Deepfake Speech Dataset
Abstract:
With the rapid advancement of neural audio codecs, codec-based speech generation (CoSG) systems have become highly powerful. Unfortunately, CoSG also enables the creation of highly realistic deepfake speech, making it easier to mimic an individual's voice and spread misinformation. We refer to this emerging deepfake speech generated by CoSG systems as CodecFake. Detecting such CodecFake is an urgent challenge, yet most existing systems primarily focus on detecting fake speech generated by traditional speech synthesis models. In this paper, we introduce CodecFake+, a large-scale dataset designed to advance CodecFake detection. To our knowledge, CodecFake+ is the largest dataset encompassing the most diverse range of codec architectures. The training set is generated through re-synthesis using 31 publicly available open-source codec models, while the evaluation set includes web-sourced data from 17 advanced CoSG models. We also propose a comprehensive taxonomy that categorizes codecs by their root components: vector quantizer, auxiliary objectives, and decoder types. Our proposed dataset and taxonomy enable detailed analysis at multiple levels to discern the key factors for successful CodecFake detection. At the individual codec level, we validate the effectiveness of using codec re-synthesized speech (CoRS) as training data for large-scale CodecFake detection. At the taxonomy level, we show that detection performance is strongest when the re-synthesis model incorporates disentanglement auxiliary objectives or a frequency-domain decoder. Furthermore, from the perspective of using all the CoRS training data, we show that our proposed taxonomy can be used to select better training data for improving detection performance. Overall, we envision that CodecFake+ will be a valuable resource for both general and fine-grained exploration to develop better anti-spoofing models against CodecFake.
中文摘要:本文提出了CodecFake+数据集,用于检测基于神经音频编解码器生成的深度伪造语音,并通过分类法揭示了分离辅助目标和频域解码器等关键因素对提升检测效果的重要性。
English Summary: The paper introduces CodecFake+, a comprehensive dataset for detecting deepfake speech generated by neural audio codecs, and proposes a taxonomy that identifies key factors like disentanglement objectives and decoder types to enhance detection performance.

Authors:Haojun Yu, Youcheng Li, Nan Zhang, Zihan Niu, Xuantong Gong, Yanwen Luo, Haotian Ye, Siyu He, Quanlin Wu, Wangyan Qin, Mengyuan Zhou, Jie Han, Jia Tao, Ziwei Zhao, Di Dai, Di He, Dong Wang, Binghui Tang, Ling Huo, James Zou, Qingli Zhu, Yong Wang, Liwei Wang
Title: A Foundational Generative Model for Breast Ultrasound Image Analysis
Abstract:
Foundational models have emerged as powerful tools for addressing various tasks in clinical settings. However, their potential development to breast ultrasound analysis remains untapped. In this paper, we present BUSGen, the first foundational generative model specifically designed for breast ultrasound image analysis. Pretrained on over 3.5 million breast ultrasound images, BUSGen has acquired extensive knowledge of breast structures, pathological features, and clinical variations. With few-shot adaptation, BUSGen can generate repositories of realistic and informative task-specific data, facilitating the development of models for a wide range of downstream tasks. Extensive experiments highlight BUSGen's exceptional adaptability, significantly exceeding real-data-trained foundational models in breast cancer screening, diagnosis, and prognosis. In breast cancer early diagnosis, our approach outperformed all board-certified radiologists (n=9), achieving an average sensitivity improvement of 16.5% (P-value<0.0001). Additionally, we characterized the scaling effect of using generated data which was as effective as the collected real-world data for training diagnostic models. Moreover, extensive experiments demonstrated that our approach improved the generalization ability of downstream models. Importantly, BUSGen protected patient privacy by enabling fully de-identified data sharing, making progress forward in secure medical data utilization. An online demo of BUSGen is available at https://aibus.bio.
中文:BUSGen是首个专为乳腺超声分析设计的基础生成模型,通过预训练超过350万张图像生成特定任务数据,显著提升癌症筛查、诊断和预后能力,其表现优于放射科医生并有效保护患者隐私。
English: BUSGen is the first foundational generative model for breast ultrasound analysis, pretrained on over 3.5 million images to generate task-specific data that enhances cancer screening, diagnosis, and prognosis, outperforming radiologists and ensuring patient privacy.

Authors:Zixuan Chen, Jing Huo, Yangtao Chen, Yang Gao
Title: RoboHorizon: An LLM-Assisted Multi-View World Model for Long-Horizon Robotic Manipulation
Abstract:
Efficient control in long-horizon robotic manipulation is challenging due to complex representation and policy learning requirements. Model-based visual reinforcement learning (RL) has shown great potential in addressing these challenges but still faces notable limitations, particularly in handling sparse rewards and complex visual features in long-horizon environments. To address these limitations, we propose the Recognize-Sense-Plan-Act (RSPA) pipeline for long-horizon tasks and further introduce RoboHorizon, an LLM-assisted multi-view world model tailored for long-horizon robotic manipulation. In RoboHorizon, pre-trained LLMs generate dense reward structures for multi-stage sub-tasks based on task language instructions, enabling robots to better recognize long-horizon tasks. Keyframe discovery is then integrated into the multi-view masked autoencoder (MAE) architecture to enhance the robot's ability to sense critical task sequences, strengthening its multi-stage perception of long-horizon processes. Leveraging these dense rewards and multi-view representations, a robotic world model is constructed to efficiently plan long-horizon tasks, enabling the robot to reliably act through RL algorithms. Experiments on two representative benchmarks, RLBench and FurnitureBench, show that RoboHorizon outperforms state-of-the-art visual model-based RL methods, achieving a 23.35% improvement in task success rates on RLBench's 4 short-horizon tasks and a 29.23% improvement on 6 long-horizon tasks from RLBench and 3 furniture assembly tasks from FurnitureBench.
中文: 提出的RoboHorizon框架通过整合大语言模型生成的密集奖励和多视角视觉表征,在长周期机器人操作任务中显著提升了成功率,优于现有最优方法。
English: The proposed RoboHorizon framework, integrating LLM-generated dense rewards and multi-view visual representations, significantly enhances robotic manipulation in long-horizon tasks by improving task success rates over state-of-the-art methods.

Authors:Wei Tang, Jiawei Yu, Yuang Li, Yanqing Zhao, Weidong Zhang, Wei Feng, Min Zhang, Hao Yang
Title: Investigating Numerical Translation with Large Language Models
Abstract:
The inaccurate translation of numbers can lead to significant security issues, ranging from financial setbacks to medical inaccuracies. While large language models (LLMs) have made significant advancements in machine translation, their capacity for translating numbers has not been thoroughly explored. This study focuses on evaluating the reliability of LLM-based machine translation systems when handling numerical data. In order to systematically test the numerical translation capabilities of currently open source LLMs, we have constructed a numerical translation dataset between Chinese and English based on real business data, encompassing ten types of numerical translation. Experiments on the dataset indicate that errors in numerical translation are a common issue, with most open-source LLMs faltering when faced with our test scenarios. Especially when it comes to numerical types involving large units like ``million", ``billion", and "yi", even the latest llama3.1 8b model can have error rates as high as 20%. Finally, we introduce three potential strategies to mitigate the numerical mistranslations for large units.
中文: 本研究发现开源大语言模型在数字翻译中普遍存在错误,处理“百万”“十亿”等大单位时错误率高达20%,并提出了三种改进策略来减少此类误译。
English: This study reveals that numerical mistranslation is a prevalent issue in open-source large language models, with error rates reaching up to 20% for large units like "million" and "billion," and proposes three strategies to address these inaccuracies.

Authors:Ruibo Wang, Mustafa A. Kishk, Howard H. Yang, Mohamed-Slim Alouini
Title: Satellite-Terrestrial Routing or Inter-Satellite Routing? A Stochastic Geometry Perspective
Abstract:
The design and comparison of satellite-terrestrial routing (STR) and inter-satellite routing (ISR) in low Earth orbit satellite constellations is a widely discussed topic. The signal propagation distance under STR is generally longer than that under ISR, resulting in greater path loss. The global deployment of gateways introduces additional costs for STR. In contrast, transmissions under ISR rely on the energy of satellites, which could be more costly. Additionally, ISLs require more complex communication protocol design, extra hardware support, and increased computational power. To maximize energy efficiency, we propose two optimal routing relay selection algorithms for ISR and STR, respectively. Furthermore, we derive the analytical expressions for the routing availability probability and energy efficiency, quantifying the performance of the algorithms. The analyses enable us to assess the performance of the proposed algorithms against existing methods through numerical results, compare the performance of STR and ISR, and provide useful insights for constellation design.
Chinese Summary: 本研究针对卫星-地面和星间通信系统提出了两种能效最优的路由算法,通过数学表达式和数值比较分析其性能,为星座设计提供实用指导。
English Summary: This study proposes two energy-efficient routing algorithms for satellite-terrestrial and inter-satellite communication systems, analyzing their performance through mathematical expressions and numerical comparisons to guide constellation design.

Authors:Aritra Banik, Fedor V. Fomin, Petr A. Golovach, Tanmay Inamdar, Satyabrata Jana, Saket Saurabh
Title: Multivariate Exploration of Metric Dilation
Abstract:
Let $G$ be a weighted graph embedded in a metric space $(M, d_M )$. The vertices of $G$ correspond to the points in $M$ , with the weight of each edge $uv$ being the distance $d_M (u, v)$ between their respective points in $M$ . The dilation (or stretch) of $G$ is defined as the minimum factor $t$ such that, for any pair of vertices $u, v$, the distance between $u$ and $v$-represented by the weight of a shortest $u$, $v$-path is at most $ t \cdot d_M (u, v)$. We study Dilation t-Augmentation, where the objective is, given a metric $M $, a graph $G$, and numerical values $k$ and $t$, to determine whether $G$ can be transformed into a graph with dilation $t$ by adding at most $k$ edges. Our primary focus is on the scenario where the metric $M$ is the shortest path metric of an unweighted graph $Γ$. Even in this specific case, Dilation $t$-Augmentation remains computationally challenging. In particular, the problem is W[2]-hard parameterized by $k$ when $Γ$ is a complete graph, already for $t=2$. Our main contribution lies in providing new insights into the impact of combinations of various parameters on the computational complexity of the problem. We establish the following. -- The parameterized dichotomy of the problem with respect to dilation $t$, when the graph $G$ is sparse: Parameterized by $k$, the problem is FPT for graphs excluding a biclique $K_{d,d}$ as a subgraph for $t\leq 2$ and the problem is W[1]-hard for $t\geq 3$ even if $G$ is a forest consisting of disjoint stars. -- The problem is FPT parameterized by the combined parameter $k+t+Δ$, where $Δ$ is the maximum degree of the graph $G$ or $Γ$.
中文: 本研究探讨了膨胀t增强问题的计算复杂性,基于边添加数量k、膨胀因子t和图的度数等参数组合,建立了参数化复杂性二分法并证明了固定参数可解性结果。
English: This research investigates the computational complexity of the Dilation t-Augmentation problem, establishing parameterized complexity dichotomies and fixed-parameter tractability results based on combinations of parameters including edge additions k, dilation factor t, and graph degrees.

Authors:Xinfa Zhu, Lei He, Yujia Xiao, Xi Wang, Xu Tan, Sheng Zhao, Lei Xie
Title: ZSVC: Zero-shot Style Voice Conversion with Disentangled Latent Diffusion Models and Adversarial Training
Abstract:
Style voice conversion aims to transform the speaking style of source speech into a desired style while keeping the original speaker's identity. However, previous style voice conversion approaches primarily focus on well-defined domains such as emotional aspects, limiting their practical applications. In this study, we present ZSVC, a novel Zero-shot Style Voice Conversion approach that utilizes a speech codec and a latent diffusion model with speech prompting mechanism to facilitate in-context learning for speaking style conversion. To disentangle speaking style and speaker timbre, we introduce information bottleneck to filter speaking style in the source speech and employ Uncertainty Modeling Adaptive Instance Normalization (UMAdaIN) to perturb the speaker timbre in the style prompt. Moreover, we propose a novel adversarial training strategy to enhance in-context learning and improve style similarity. Experiments conducted on 44,000 hours of speech data demonstrate the superior performance of ZSVC in generating speech with diverse speaking styles in zero-shot scenarios.
中文: ZSVC是一种创新的零样本语音风格转换方法,它结合语音编解码器和潜在扩散模型,通过语音提示机制实现上下文学习,在保持说话人音色的同时转换语音风格,并在44,000小时语音数据上展现出卓越性能。
English: ZSVC is a novel zero-shot style voice conversion method that uses a speech codec and latent diffusion model with a speech prompting mechanism to enable in-context learning for converting speaking styles while preserving speaker identity, achieving superior performance on 44,000 hours of speech data.

Authors:Huimeng Wang, Xurong Xie, Mengzhe Geng, Shujie Hu, Haoning Xu, Youjun Chen, Zhaoqing Li, Jiajun Deng, Xunying Liu
Title: Phone-purity Guided Discrete Tokens for Dysarthric Speech Recognition
Abstract:
Discrete tokens extracted provide efficient and domain adaptable speech features. Their application to disordered speech that exhibits articulation imprecision and large mismatch against normal voice remains unexplored. To improve their phonetic discrimination that is weakened during unsupervised K-means or vector quantization of continuous features, this paper proposes novel phone-purity guided (PPG) discrete tokens for dysarthric speech recognition. Phonetic label supervision is used to regularize maximum likelihood and reconstruction error costs used in standard K-means and VAE-VQ based discrete token extraction. Experiments conducted on the UASpeech corpus suggest that the proposed PPG discrete token features extracted from HuBERT consistently outperform hybrid TDNN and End-to-End (E2E) Conformer systems using non-PPG based K-means or VAE-VQ tokens across varying codebook sizes by statistically significant word error rate (WER) reductions up to 0.99\% and 1.77\% absolute (3.21\% and 4.82\% relative) respectively on the UASpeech test set of 16 dysarthric speakers. The lowest WER of 23.25\% was obtained by combining systems using different token features. Consistent improvements on the phone purity metric were also achieved. T-SNE visualization further demonstrates sharper decision boundaries were produced between K-means/VAE-VQ clusters after introducing phone-purity guidance.
本文提出了一种基于音素纯度的离散标记方法,通过在有监督条件下规范特征提取过程来增强构音障碍语音的区分能力,相比传统方法显著降低了词错误率。
This paper introduces phone-purity guided discrete tokens to enhance phonetic discrimination in dysarthric speech recognition, achieving significant word error rate reductions over conventional methods through supervised regularization of feature extraction processes.

Authors:Haoning Xu, Zhaoqing Li, Zengrui Jin, Huimeng Wang, Youjun Chen, Guinan Li, Mengzhe Geng, Shujie Hu, Jiajun Deng, Xunying Liu
Title: Effective and Efficient Mixed Precision Quantization of Speech Foundation Models
Abstract:
This paper presents a novel mixed-precision quantization approach for speech foundation models that tightly integrates mixed-precision learning and quantized model parameter estimation into one single model compression stage. Experiments conducted on LibriSpeech dataset with fine-tuned wav2vec2.0-base and HuBERT-large models suggest the resulting mixed-precision quantized models increased the lossless compression ratio by factors up to 1.7x and 1.9x over the respective uniform-precision and two-stage mixed-precision quantized baselines that perform precision learning and model parameters quantization in separate and disjointed stages, while incurring no statistically word error rate (WER) increase over the 32-bit full-precision models. The system compression time of wav2vec2.0-base and HuBERT-large models is reduced by up to 1.9 and 1.5 times over the two-stage mixed-precision baselines, while both produce lower WERs. The best-performing 3.5-bit mixed-precision quantized HuBERT-large model produces a lossless compression ratio of 8.6x over the 32-bit full-precision system.
本文提出了一种统一的混合精度量化方法,将精度学习与参数估计同步优化,在保持语音识别准确率的同时实现了更高压缩比和更快处理速度。
This paper introduces a unified mixed-precision quantization method that simultaneously optimizes precision learning and parameter estimation, achieving higher compression ratios and faster processing without compromising speech recognition accuracy.

Authors:Prashant Trivedi, Souradip Chakraborty, Avinash Reddy, Vaneet Aggarwal, Amrit Singh Bedi, George K. Atia
Title: Align-Pro: A Principled Approach to Prompt Optimization for LLM Alignment
Abstract:
The alignment of large language models (LLMs) with human values is critical as these models become increasingly integrated into various societal and decision-making processes. Traditional methods, such as reinforcement learning from human feedback (RLHF), achieve alignment by fine-tuning model parameters, but these approaches are often computationally expensive and impractical when models are frozen or inaccessible for parameter modification. In contrast, prompt optimization is a viable alternative to RLHF for LLM alignment. While the existing literature has shown empirical promise of prompt optimization, its theoretical underpinning remains under-explored. We address this gap by formulating prompt optimization as an optimization problem and try to provide theoretical insights into the optimality of such a framework. To analyze the performance of the prompt optimization, we study theoretical suboptimality bounds and provide insights in terms of how prompt optimization depends upon the given prompter and target model. We also provide empirical validation through experiments on various datasets, demonstrating that prompt optimization can effectively align LLMs, even when parameter fine-tuning is not feasible.
中文摘要:提示优化作为一种替代人类反馈强化学习的方法,能有效对齐大语言模型与人类价值观,本研究不仅为其建立了理论基础,还通过实验验证了其实际可行性。
English Summary: Prompt optimization offers a computationally efficient alternative to reinforcement learning from human feedback for aligning large language models with human values, with this study providing both theoretical foundations and empirical validation for its effectiveness.

Authors:Rui Xie, Yinhong Liu, Penghao Zhou, Chen Zhao, Jun Zhou, Kai Zhang, Zhenyu Zhang, Jian Yang, Zhenheng Yang, Ying Tai
Title: STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution
Abstract:
Image diffusion models have been adapted for real-world video super-resolution to tackle over-smoothing issues in GAN-based methods. However, these models struggle to maintain temporal consistency, as they are trained on static images, limiting their ability to capture temporal dynamics effectively. Integrating text-to-video (T2V) models into video super-resolution for improved temporal modeling is straightforward. However, two key challenges remain: artifacts introduced by complex degradations in real-world scenarios, and compromised fidelity due to the strong generative capacity of powerful T2V models (\textit{e.g.}, CogVideoX-5B). To enhance the spatio-temporal quality of restored videos, we introduce\textbf{~\name} (\textbf{S}patial-\textbf{T}emporal \textbf{A}ugmentation with T2V models for \textbf{R}eal-world video super-resolution), a novel approach that leverages T2V models for real-world video super-resolution, achieving realistic spatial details and robust temporal consistency. Specifically, we introduce a Local Information Enhancement Module (LIEM) before the global attention block to enrich local details and mitigate degradation artifacts. Moreover, we propose a Dynamic Frequency (DF) Loss to reinforce fidelity, guiding the model to focus on different frequency components across diffusion steps. Extensive experiments demonstrate\textbf{~\name}~outperforms state-of-the-art methods on both synthetic and real-world datasets.
Chinese Summary: 针对图像扩散模型在视频超分辨率中存在的时序一致性不足和伪影问题,本研究提出一种结合文本到视频模型的新方法,通过局部信息增强模块和动态频率损失来提升空间细节与保真度,实现更优的时空质量恢复效果。
English Summary: Image diffusion models adapted for video super-resolution face challenges in temporal consistency and artifact handling, prompting the development of a novel approach that integrates text-to-video models with specialized modules to enhance spatial details and fidelity while maintaining robust temporal coherence.

Authors:Ao Gao, Luosong Guo, Tao Chen, Zhao Wang, Ying Tai, Jian Yang, Zhenyu Zhang
Title: EasySplat: View-Adaptive Learning makes 3D Gaussian Splatting Easy
Abstract:
3D Gaussian Splatting (3DGS) techniques have achieved satisfactory 3D scene representation. Despite their impressive performance, they confront challenges due to the limitation of structure-from-motion (SfM) methods on acquiring accurate scene initialization, or the inefficiency of densification strategy. In this paper, we introduce a novel framework EasySplat to achieve high-quality 3DGS modeling. Instead of using SfM for scene initialization, we employ a novel method to release the power of large-scale pointmap approaches. Specifically, we propose an efficient grouping strategy based on view similarity, and use robust pointmap priors to obtain high-quality point clouds and camera poses for 3D scene initialization. After obtaining a reliable scene structure, we propose a novel densification approach that adaptively splits Gaussian primitives based on the average shape of neighboring Gaussian ellipsoids, utilizing KNN scheme. In this way, the proposed method tackles the limitation on initialization and optimization, leading to an efficient and accurate 3DGS modeling. Extensive experiments demonstrate that EasySplat outperforms the current state-of-the-art (SOTA) in handling novel view synthesis.
Chinese: EasySplat提出了一种新框架,通过大规模点云方法改进场景初始化,并采用自适应高斯基元分割优化致密化策略,从而在3D高斯溅射中实现了更优的新视角合成效果。
English: EasySplat introduces a novel framework that enhances 3D Gaussian Splatting by improving scene initialization with large-scale pointmap methods and optimizing densification through adaptive splitting of Gaussian primitives, achieving superior novel view synthesis performance.

Authors:Shoutao Guo, Shaolei Zhang, Zhengrui Ma, Yang Feng
Title: Large Language Models Are Read/Write Policy-Makers for Simultaneous Generation
Abstract:
Simultaneous generation models write generation results while reading streaming inputs, necessitating a policy-maker to determine the appropriate output timing. Existing simultaneous generation methods generally adopt the traditional encoder-decoder architecture and learn the generation and policy-making capabilities through complex dynamic programming techniques. Although LLMs excel at text generation, they face challenges in taking on the role of policy-makers through traditional training methods, limiting their exploration in simultaneous generation. To overcome these limitations, we propose a novel LLM-driven Simultaneous Generation (LSG) framework, which allows the off-the-shelf LLM to decide the generation timing and produce output concurrently. Specifically, LSG selects the generation policy that minimizes latency as the baseline policy. Referring to the baseline policy, LSG enables the LLM to devise an improved generation policy that better balances latency and generation quality, and writes generation results accordingly. Experiments on simultaneous translation and streaming automatic speech recognition tasks show that our method can achieve state-of-the-art performance utilizing the open-source LLMs and demonstrate practicality in real-world scenarios.
中文摘要:提出的LLM驱动同步生成(LSG)框架使现成大语言模型能够自主决定输出时机并同步生成文本,在即时翻译等任务中实现了延迟与生成质量的最佳平衡。
English Summary: The proposed LLM-driven Simultaneous Generation (LSG) framework enables off-the-shelf large language models to autonomously determine output timing while generating text, achieving optimal balance between latency and quality in real-time applications like simultaneous translation.

Authors:Haibo Tong, Enmeng Lu, Yinqian Sun, Zhengqiang Han, Chao Liu, Feifei Zhao, Yi Zeng
Title: Autonomous Alignment with Human Value on Altruism through Considerate Self-imagination and Theory of Mind
Abstract:
With the widespread application of Artificial Intelligence (AI) in human society, enabling AI to autonomously align with human values has become a pressing issue to ensure its sustainable development and benefit to humanity. One of the most important aspects of aligning with human values is the necessity for agents to autonomously make altruistic, safe, and ethical decisions, considering and caring for human well-being. Current AI extremely pursues absolute superiority in certain tasks, remaining indifferent to the surrounding environment and other agents, which has led to numerous safety risks. Altruistic behavior in human society originates from humans' capacity for empathizing others, known as Theory of Mind (ToM), combined with predictive imaginative interactions before taking action to produce thoughtful and altruistic behaviors. Inspired by this, we are committed to endow agents with considerate self-imagination and ToM capabilities, driving them through implicit intrinsic motivations to autonomously align with human altruistic values. By integrating ToM within the imaginative space, agents keep an eye on the well-being of other agents in real time, proactively anticipate potential risks to themselves and others, and make thoughtful altruistic decisions that balance negative effects on the environment. The ancient Chinese story of Sima Guang Smashes the Vat illustrates the moral behavior of the young Sima Guang smashed a vat to save a child who had accidentally fallen into it, which is an excellent reference scenario for this paper. We design an experimental scenario similar to Sima Guang Smashes the Vat and its variants with different complexities, which reflects the trade-offs and comprehensive considerations between self-goals, altruistic rescue, and avoiding negative side effects.
中文摘要:本研究通过将心理理论和自我想象能力融入智能体,使其能够自主做出考虑他人福祉的利他决策,并以司马光砸缸救人的经典故事作为实验场景进行验证。
English Summary: This research proposes integrating Theory of Mind and self-imagination capabilities into AI agents to enable autonomous altruistic decision-making that considers human well-being, using the classical Chinese story of Sima Guang saving a child as an experimental scenario.

Authors:Junchen Ding, Jiahao Zhang, Yi Liu, Ziqi Ding, Gelei Deng, Yuekang Li
Title: TombRaider: Entering the Vault of History to Jailbreak Large Language Models
Abstract:
Warning: This paper contains content that may involve potentially harmful behaviours, discussed strictly for research purposes. Jailbreak attacks can hinder the safety of Large Language Model (LLM) applications, especially chatbots. Studying jailbreak techniques is an important AI red teaming task for improving the safety of these applications. In this paper, we introduce TombRaider, a novel jailbreak technique that exploits the ability to store, retrieve, and use historical knowledge of LLMs. TombRaider employs two agents, the inspector agent to extract relevant historical information and the attacker agent to generate adversarial prompts, enabling effective bypassing of safety filters. We intensively evaluated TombRaider on six popular models. Experimental results showed that TombRaider could outperform state-of-the-art jailbreak techniques, achieving nearly 100% attack success rates (ASRs) on bare models and maintaining over 55.4% ASR against defence mechanisms. Our findings highlight critical vulnerabilities in existing LLM safeguards, underscoring the need for more robust safety defences.
中文: TombRaider是一种新型越狱技术,通过双智能体利用大语言模型的历史知识,在无防护模型上实现近乎完美的攻击成功率,并在防御机制下保持超过55%的成功率,揭示了现有安全防护的关键漏洞。
English: TombRaider is a novel jailbreak technique using dual agents to exploit LLMs' historical knowledge, achieving near-perfect attack success on unprotected models and over 55% against defenses, revealing critical vulnerabilities in current safeguards.

Authors:Ping Yu, Weizhe Yuan, Olga Golovneva, Tianhao Wu, Sainbayar Sukhbaatar, Jason Weston, Jing Xu
Title: R.I.P.: Better Models by Survival of the Fittest Prompts
Abstract:
Training data quality is one of the most important drivers of final model quality. In this work, we introduce a method for evaluating data integrity based on the assumption that low-quality input prompts result in high variance and low quality responses. This is achieved by measuring the rejected response quality and the reward gap between the chosen and rejected preference pair. Our method, Rejecting Instruction Preferences (RIP) can be used to filter prompts from existing training sets, or to make high quality synthetic datasets, yielding large performance gains across various benchmarks compared to unfiltered data. Using Llama 3.1-8B-Instruct, RIP improves AlpacaEval2 LC Win Rate by 9.4%, Arena-Hard by 8.7%, and WildBench by 9.9%. Using Llama 3.3-70B-Instruct, RIP improves Arena-Hard from 67.5 to 82.9, which is from 18th place to 6th overall in the leaderboard.
中文: 本文提出的拒绝指令偏好(RIP)方法通过分析响应方差和奖励差距来过滤低质量提示,从而提升模型性能,在多个基准测试中均取得显著改进。
English: This paper introduces the Rejecting Instruction Preferences (RIP) method, which enhances model performance by filtering low-quality prompts through analyzing response variance and reward gaps, achieving significant gains across multiple benchmarks.

Authors:Jack Lanchantin, Angelica Chen, Shehzaad Dhuliawala, Ping Yu, Jason Weston, Sainbayar Sukhbaatar, Ilia Kulikov
Title: Diverse Preference Optimization
Abstract:
Post-training of language models, either through reinforcement learning, preference optimization or supervised finetuning, tends to sharpen the output probability distribution and reduce the diversity of generated responses. This is particularly a problem for creative generative tasks where varied responses are desired. In this work we introduce Diverse Preference Optimization (DivPO), an optimization method which learns to generate much more diverse responses than standard pipelines, while maintaining the quality of the generations. In DivPO, preference pairs are selected by first considering a pool of responses, and a measure of diversity among them, and selecting chosen examples as being more rare but high quality, while rejected examples are more common, but low quality. DivPO results in generating 45.6% more diverse persona attributes, and a 74.6% increase in story diversity, while maintaining similar win rates as standard baselines. On general instruction following, DivPO results in a 46.2% increase in diversity, and a 2.4% winrate improvement compared to DPO.
DivPO is a novel optimization method that significantly enhances response diversity in language models while preserving generation quality, achieving up to 74.6% increased diversity without compromising performance compared to standard approaches.
English Summary:

Authors:Didier Chételat, Joseph Cotnareanu, Rylee Thompson, Yingxue Zhang, Mark Coates
Title: InnerThoughts: Disentangling Representations and Predictions in Large Language Models
Abstract:
Large language models (LLMs) contain substantial factual knowledge which is commonly elicited by multiple-choice question-answering prompts. Internally, such models process the prompt through multiple transformer layers, building varying representations of the problem within its hidden states. Ultimately, however, only the hidden state corresponding to the final layer and token position are used to predict the answer label. In this work, we propose instead to learn a small separate neural network predictor module on a collection of training questions, that take the hidden states from all the layers at the last temporal position as input and outputs predictions. In effect, such a framework disentangles the representational abilities of LLMs from their predictive abilities. On a collection of hard benchmarks, our method achieves considerable improvements in performance, sometimes comparable to supervised fine-tuning procedures, but at a fraction of the computational cost.
中文: 本研究提出了一种独立的神经网络预测器,通过利用大语言模型所有层的隐藏状态来增强事实知识提取能力,在多项困难基准测试中实现了显著性能提升,同时大幅降低了计算成本。
English: This study introduces a separate neural network predictor that utilizes hidden states from all layers of large language models to enhance factual knowledge extraction, achieving significant performance gains on challenging benchmarks with minimal computational expense.

Authors:Siqi Wang, Yuanze Hu, Xinwang Liu, Siwei Wang, Guangpu Wang, Chuanfu Xu, Jie Liu, Ping Chen
Title: "Stones from Other Hills can Polish Jade": Zero-shot Anomaly Image Synthesis via Cross-domain Anomaly Injection
Abstract:
Industrial image anomaly detection (IAD) is a pivotal topic with huge value. Due to anomaly's nature, real anomalies in a specific modern industrial domain (i.e. domain-specific anomalies) are usually too rare to collect, which severely hinders IAD. Thus, zero-shot anomaly synthesis (ZSAS), which synthesizes pseudo anomaly images without any domain-specific anomaly, emerges as a vital technique for IAD. However, existing solutions are either unable to synthesize authentic pseudo anomalies, or require cumbersome training. Thus, we focus on ZSAS and propose a brand-new paradigm that can realize both authentic and training-free ZSAS. It is based on a chronically-ignored fact: Although domain-specific anomalies are rare, real anomalies from other domains (i.e. cross-domain anomalies) are actually abundant and directly applicable to ZSAS. Specifically, our new ZSAS paradigm makes three-fold contributions: First, we propose a novel method named Cross-domain Anomaly Injection (CAI), which directly exploits cross-domain anomalies to enable highly authentic ZSAS in a training-free manner. Second, to supply CAI with sufficient cross-domain anomalies, we build the first Domain-agnostic Anomaly Dataset within our best knowledge, which provides ZSAS with abundant real anomaly patterns. Third, we propose a CAI-guided Diffusion Mechanism, which further breaks the quantity limit of real anomalies and enable unlimited anomaly synthesis. Our head-to-head comparison with existing ZSAS solutions justifies our paradigm's superior performance for IAD and demonstrates it as an effective and pragmatic ZSAS solution.
中文: 本文提出一种新颖的零样本异常合成范式,通过跨域异常注入和扩散机制利用跨域异常,实现了无需训练的高真实性异常检测,性能卓越。
English: This paper introduces a novel zero-shot anomaly synthesis paradigm that leverages cross-domain anomalies through Cross-domain Anomaly Injection and a diffusion mechanism to achieve authentic, training-free anomaly detection with superior performance.

Authors:Alexander Bonora, Alessandro Traspadini, Marco Giordani, Michele Zorzi
Title: Performance Evaluation of Satellite-Based Data Offloading on Starlink Constellations
Abstract:
Vehicular Edge Computing (VEC) is a key research area in autonomous driving. As Intelligent Transportation Systems (ITSs) continue to expand, ground vehicles (GVs) face the challenge of handling huge amounts of sensor data to drive safely. Specifically, due to energy and capacity limitations, GVs will need to offload resource-hungry tasks to external (cloud) computing units for faster processing. In 6th generation (6G) wireless systems, the research community is exploring the concept of Non-Terrestrial Networks (NTNs), where satellites can serve as space edge computing nodes to aggregate, store, and process data from GVs. In this paper we propose new data offloading strategies between a cluster of GVs and satellites in the Low Earth Orbits (LEOs), to optimize the trade-off between coverage and end-to-end delay. For the accuracy of the simulations, we consider real data and orbits from the Starlink constellation, one of the most representative and popular examples of commercial satellite deployments for communication. Our results demonstrate that Starlink satellites can support real-time offloading under certain conditions that depend on the onboard computational capacity of the satellites, the frame rate of the sensors, and the number of GVs.
中文摘要:本文提出地面车辆与低轨卫星间的新型数据卸载策略,利用星链星座数据优化覆盖与延迟的平衡,证明在特定卫星算力和传感器条件下可实现实时传输。
English Summary: The paper proposes novel data offloading strategies between ground vehicles and Low Earth Orbit satellites, using Starlink constellation data to optimize coverage-delay trade-offs, demonstrating real-time feasibility under specific satellite capacity and sensor conditions.

Authors:Jiaxin Bai, Zihao Wang, Yukun Zhou, Hang Yin, Weizhi Fei, Qi Hu, Zheye Deng, Jiayang Cheng, Tianshi Zheng, Hong Ting Tsang, Yisen Gao, Zhongwei Xie, Yufei Li, Lixin Fan, Binhang Yuan, Wei Wang, Lei Chen, Xiaofang Zhou, Yangqiu Song
Title: Top Ten Challenges Towards Agentic Neural Graph Databases
Abstract:
Graph databases (GDBs) like Neo4j and TigerGraph excel at handling interconnected data but lack advanced inference capabilities. Neural Graph Databases (NGDBs) address this by integrating Graph Neural Networks (GNNs) for predictive analysis and reasoning over incomplete or noisy data. However, NGDBs rely on predefined queries and lack autonomy and adaptability. This paper introduces Agentic Neural Graph Databases (Agentic NGDBs), which extend NGDBs with three core functionalities: autonomous query construction, neural query execution, and continuous learning. We identify ten key challenges in realizing Agentic NGDBs: semantic unit representation, abductive reasoning, scalable query execution, and integration with foundation models like large language models (LLMs). By addressing these challenges, Agentic NGDBs can enable intelligent, self-improving systems for modern data-driven applications, paving the way for adaptable and autonomous data management solutions.
中文: 本文提出代理神经图数据库(Agentic NGDBs),通过集成自主查询构建、神经查询执行和持续学习三大核心功能来增强神经图数据库,解决了适应性不足和缺乏自主性的问题,并针对实现智能自进化数据管理系统提出了十大关键挑战。
English: This paper proposes Agentic Neural Graph Databases (Agentic NGDBs), which enhance Neural Graph Databases with autonomous query construction, neural query execution, and continuous learning to overcome limitations in adaptability and autonomy, addressing ten key challenges for intelligent, self-improving data management systems.

Authors:Yogya Gamage, Nadia Gonzalez Fernandez, Martin Monperrus, Benoit Baudry
Title: Software Bills of Materials in Maven Central
Abstract:
Software Bills of Materials (SBOMs) are essential to ensure the transparency and integrity of the software supply chain. There is a growing body of work that investigates the accuracy of SBOM generation tools and the challenges for producing complete SBOMs. Yet, there is little knowledge about how developers distribute SBOMs. In this work, we mine SBOMs from Maven Central to assess the extent to which developers publish SBOMs along with the artifacts. We develop our work on top of the Goblin framework, which consists of a Maven Central dependency graph and a Weaver that allows augmenting the dependency graph with additional data. For this study, we select a sample of 10% of release nodes from the Maven Central dependency graph and collected 14,071 SBOMs from 7,290 package releases. We then augment the Maven Central dependency graph with the collected SBOMs. We present our methodology to mine SBOMs, as well as novel insights about SBOM publication. Our dataset is the first set of SBOMs collected from a package registry. We make it available as a standalone dataset, which can be used for future research about SBOMs and package distribution.
中文: 本研究从Maven中央仓库挖掘软件物料清单(SBOM),通过收集14,071份SBOM并增强依赖图的方法评估开发者发布情况,揭示了SBOM发布新见解,提供了首个来自软件包注册表的SBOM数据集以供未来研究。
English: This study mines Software Bills of Materials (SBOMs) from Maven Central to evaluate developer publication practices, revealing novel insights through a methodology that collected 14,071 SBOMs and augmented the dependency graph, providing the first SBOM dataset from a package registry for future research.

Authors:Han Li, Shaohui Li, Wenrui Dai, Maida Cao, Nuowen Kan, Chenglin Li, Junni Zou, Hongkai Xiong
Title: On Disentangled Training for Nonlinear Transform in Learned Image Compression
Abstract:
Learned image compression (LIC) has demonstrated superior rate-distortion (R-D) performance compared to traditional codecs, but is challenged by training inefficiency that could incur more than two weeks to train a state-of-the-art model from scratch. Existing LIC methods overlook the slow convergence caused by compacting energy in learning nonlinear transforms. In this paper, we first reveal that such energy compaction consists of two components, i.e., feature decorrelation and uneven energy modulation. On such basis, we propose a linear auxiliary transform (AuxT) to disentangle energy compaction in training nonlinear transforms. The proposed AuxT obtains coarse approximation to achieve efficient energy compaction such that distribution fitting with the nonlinear transforms can be simplified to fine details. We then develop wavelet-based linear shortcuts (WLSs) for AuxT that leverages wavelet-based downsampling and orthogonal linear projection for feature decorrelation and subband-aware scaling for
中文摘要:学习型图像压缩虽性能优越但训练缓慢,本文通过提出线性辅助变换和小波线性捷径来解决能量压缩问题,有效加速模型收敛。
English summary: Learned image compression achieves superior performance but suffers from slow training due to energy compaction issues, which this paper addresses by proposing a linear auxiliary transform and wavelet-based shortcuts to accelerate convergence.

Authors:Yoshiki Masuyama, Gordon Wichern, François G. Germain, Christopher Ick, Jonathan Le Roux
Title: Retrieval-Augmented Neural Field for HRTF Upsampling and Personalization
Abstract:
Head-related transfer functions (HRTFs) with dense spatial grids are desired for immersive binaural audio generation, but their recording is time-consuming. Although HRTF spatial upsampling has shown remarkable progress with neural fields, spatial upsampling only from a few measured directions, e.g., 3 or 5 measurements, is still challenging. To tackle this problem, we propose a retrieval-augmented neural field (RANF). RANF retrieves a subject whose HRTFs are close to those of the target subject from a dataset. The HRTF of the retrieved subject at the desired direction is fed into the neural field in addition to the sound source direction itself. Furthermore, we present a neural network that can efficiently handle multiple retrieved subjects, inspired by a multi-channel processing technique called transform-average-concatenate. Our experiments confirm the benefits of RANF on the SONICOM dataset, and it is a key component in the winning solution of Task 2 of the listener acoustic personalization challenge 2024.
中文: 提出的检索增强神经场(RANF)通过从数据集中检索相似对象的HRTF数据,显著提升了稀疏测量的空间上采样效果,在SONICOM数据集和2024年听觉个性化挑战赛中验证了其优越性。
English: The proposed retrieval-augmented neural field (RANF) enhances HRTF spatial upsampling by incorporating similar subjects' data from a dataset, achieving superior performance validated through the SONICOM dataset and a 2024 challenge victory.

Authors:He Chang, Jie Wu, Zhulin Tao, Yunshan Ma, Xianglin Huang, Tat-Seng Chua
Title: Integrate Temporal Graph Learning into LLM-based Temporal Knowledge Graph Model
Abstract:
Temporal Knowledge Graph Forecasting (TKGF) aims to predict future events based on the observed events in history. Recently, Large Language Models (LLMs) have exhibited remarkable capabilities, generating significant research interest in their application for reasoning over temporal knowledge graphs (TKGs). Existing LLM-based methods have integrated retrieved historical facts or static graph representations into LLMs. Despite the notable performance of LLM-based methods, they are limited by the insufficient modeling of temporal patterns and ineffective cross-modal alignment between graph and language, hindering the ability of LLMs to fully grasp the temporal and structural information in TKGs. To tackle these issues, we propose a novel framework TGL-LLM to integrate temporal graph learning into LLM-based temporal knowledge graph model. Specifically, we introduce temporal graph learning to capture the temporal and relational patterns and obtain the historical graph embedding. Furthermore, we design a hybrid graph tokenization to sufficiently model the temporal patterns within LLMs. To achieve better alignment between graph and language, we employ a two-stage training paradigm to finetune LLMs on high-quality and diverse data, thereby resulting in better performance. Extensive experiments on three real-world datasets show that our approach outperforms a range of state-of-the-art (SOTA) methods.
中文摘要:TGL-LLM框架通过将时序图学习与大语言模型相结合,采用混合图标记化和两阶段训练策略,有效提升了时序模式建模与跨模态对齐能力,在真实数据集上实现了最先进的预测性能。
English Summary: The TGL-LLM framework enhances temporal knowledge graph forecasting by integrating temporal graph learning with large language models, improving temporal pattern capture and cross-modal alignment through hybrid tokenization and two-stage training, achieving superior performance on real-world datasets.

Authors:Wenxuan Li, Alan Yuille, Zongwei Zhou
Title: How Well Do Supervised 3D Models Transfer to Medical Imaging Tasks?
Abstract:
The pre-training and fine-tuning paradigm has become prominent in transfer learning. For example, if the model is pre-trained on ImageNet and then fine-tuned to PASCAL, it can significantly outperform that trained on PASCAL from scratch. While ImageNet pre-training has shown enormous success, it is formed in 2D, and the learned features are for classification tasks; when transferring to more diverse tasks, like 3D image segmentation, its performance is inevitably compromised due to the deviation from the original ImageNet context. A significant challenge lies in the lack of large, annotated 3D datasets rivaling the scale of ImageNet for model pre-training. To overcome this challenge, we make two contributions. Firstly, we construct AbdomenAtlas 1.1 that comprises 9,262 three-dimensional computed tomography (CT) volumes with high-quality, per-voxel annotations of 25 anatomical structures and pseudo annotations of seven tumor types. Secondly, we develop a suite of models that are pre-trained on our AbdomenAtlas 1.1 for transfer learning. Our preliminary analyses indicate that the model trained only with 21 CT volumes, 672 masks, and 40 GPU hours has a transfer learning ability similar to the model trained with 5,050 (unlabeled) CT volumes and 1,152 GPU hours. More importantly, the transfer learning ability of supervised models can further scale up with larger annotated datasets, achieving significantly better performance than preexisting pre-trained models, irrespective of their pre-training methodologies or data sources. We hope this study can facilitate collective efforts in constructing larger 3D medical datasets and more releases of supervised pre-trained models.
中文: 本研究提出了AbdomenAtlas 1.1这一大规模标注的3D CT数据集及预训练模型,显著提升了医学影像任务的迁移学习效果,其性能优于现有方法且具备可扩展性。
English: The study introduces AbdomenAtlas 1.1, a large annotated 3D CT dataset, and pre-trained models that significantly enhance transfer learning for medical imaging tasks, outperforming existing methods and enabling scalable performance improvements.

Authors:Li-Hsiang Shen, Jyun-Jhe Huang, Kai-Ten Feng, Lie-Liang Yang, Jen-Ming Wu
Title: Federated Deep Reinforcement Learning for Energy Efficient Multi-Functional RIS-Assisted Low-Earth Orbit Networks
Abstract:
In this paper, a novel network architecture that deploys the multi-functional reconfigurable intelligent surface (MF-RIS) in low-Earth orbit (LEO) is proposed. Unlike traditional RIS with only signal reflection capability, the MF-RIS can reflect, refract, and amplify signals, as well as harvest energy from wireless signals. Given the high energy demands in shadow regions where solar energy is unavailable, MF-RIS is deployed in LEO to enhance signal coverage and improve energy efficiency (EE). To address this, we formulate a long-term EE optimization problem by determining the optimal parameters for MF-RIS configurations, including amplification and phase-shifts, energy harvesting ratios, and LEO transmit beamforming. To address the complex non-convex and non-linear problem, a federated learning enhanced multi-agent deep deterministic policy gradient (FEMAD) scheme is designed. Multi-agent DDPG of each agent can provide the optimal action policy from its interaction to environments, whereas federated learning enables the hidden information exchange among multi-agents. In numerical results, we can observe significant EE improvements compared to the other benchmarks, including centralized deep reinforcement learning as well as distributed multi-agent deep deterministic policy gradient (DDPG). Additionally, the proposed LEO-MF-RIS architecture has demonstrated its effectiveness, achieving the highest EE performance compared to the scenarios of fixed/no energy harvesting in MF-RIS, traditional reflection-only RIS, and deployment without RISs/MF-RISs.
中文摘要:本文提出了一种部署于低地球轨道的多功能可重构智能表面新架构,通过联邦学习增强的多智能体深度强化学习方案优化能量效率,相比传统方案展现出显著性能优势。
English Summary: This paper introduces a novel LEO-deployed multi-functional reconfigurable intelligent surface (MF-RIS) architecture that enhances both signal coverage and energy efficiency through a federated learning-enhanced multi-agent deep reinforcement learning scheme, demonstrating superior performance over existing benchmarks.

Authors:Yen-Ting Lin, Di Jin, Tengyu Xu, Tianhao Wu, Sainbayar Sukhbaatar, Chen Zhu, Yun He, Yun-Nung Chen, Jason Weston, Yuandong Tian, Arash Rahnama, Sinong Wang, Hao Ma, Han Fang
Title: Step-KTO: Optimizing Mathematical Reasoning through Stepwise Binary Feedback
Abstract:
Large language models (LLMs) have recently demonstrated remarkable success in mathematical reasoning. Despite progress in methods like chain-of-thought prompting and self-consistency sampling, these advances often focus on final correctness without ensuring that the underlying reasoning process is coherent and reliable. This paper introduces Step-KTO, a training framework that combines process-level and outcome-level binary feedback to guide LLMs toward more trustworthy reasoning trajectories. By providing binary evaluations for both the intermediate reasoning steps and the final answer, Step-KTO encourages the model to adhere to logical progressions rather than relying on superficial shortcuts. Our experiments on challenging mathematical benchmarks show that Step-KTO significantly improves both final answer accuracy and the quality of intermediate reasoning steps. For example, on the MATH-500 dataset, Step-KTO achieves a notable improvement in Pass@1 accuracy over strong baselines. These results highlight the promise of integrating stepwise process feedback into LLM training, paving the way toward more interpretable and dependable reasoning capabilities.
大型语言模型在数学推理方面展现出强大能力,但现有方法常忽视推理过程的可靠性;本文提出Step-KTO训练框架,通过对中间步骤和最终结果的双重二元反馈来提升推理连贯性和准确性,在数学基准测试中取得显著进步。
Large language models have shown strong mathematical reasoning abilities, but existing methods often overlook the reliability of the reasoning process; this paper introduces Step-KTO, a training framework that uses binary feedback on both intermediate steps and final outcomes to enhance reasoning coherence and accuracy, achieving significant improvements on mathematical benchmarks.

Authors:Rui Meng, Dayu Fan, Haixiao Gao, Yifan Yuan, Bizhu Wang, Xiaodong Xu, Mengying Sun, Chen Dong, Xiaofeng Tao, Ping Zhang, Dusit Niyato
Title: Secure Semantic Communication With Homomorphic Encryption
Abstract:
In recent years, Semantic Communication (SemCom), which aims to achieve efficient and reliable transmission of meaning between agents, has garnered significant attention from both academia and industry. To ensure the security of communication systems, encryption techniques are employed to safeguard confidentiality and integrity. However, traditional cryptography-based encryption algorithms encounter obstacles when applied to SemCom. Motivated by this, this paper explores the feasibility of applying homomorphic encryption to SemCom. Initially, we review the encryption algorithms utilized in mobile communication systems and analyze the challenges associated with their application to SemCom. Subsequently, we employ scale-invariant feature transform to demonstrate that semantic features can be preserved in homomorphic encrypted ciphertext. Based on this finding, we propose a task-oriented SemCom scheme secured through homomorphic encryption. We design the privacy preserved deep joint source-channel coding (JSCC) encoder and decoder, and the frequency of key updates can be adjusted according to service requirements without compromising transmission performance. Simulation results validate that, when compared to plaintext images, the proposed scheme can achieve almost the same classification accuracy performance when dealing with homomorphic ciphertext images. Furthermore, we provide potential future research directions for homomorphic encrypted SemCom.
中文摘要:本文研究同态加密在语义通信中的应用,提出一种安全的面向任务方案,能在加密数据中保持语义特征,且在不影响传输性能的前提下实现与明文几乎相同的分类准确率。
English Summary: This paper investigates the application of homomorphic encryption to semantic communication, proposing a secure, task-oriented scheme that maintains semantic features in encrypted data and achieves comparable classification accuracy to plaintext without compromising transmission performance.

Authors:Enes Karanfil, Nevrez Imamoglu, Erkut Erdem, Aykut Erdem
Title: A Vision-Language Framework for Multispectral Scene Representation Using Language-Grounded Features
Abstract:
Scene understanding in remote sensing often faces challenges in generating accurate representations for complex environments such as various land use areas or coastal regions, which may also include snow, clouds, or haze. To address this, we present a vision-language framework named Spectral LLaVA, which integrates multispectral data with vision-language alignment techniques to enhance scene representation and description. Using the BigEarthNet v2 dataset from Sentinel-2, we establish a baseline with RGB-based scene descriptions and further demonstrate substantial improvements through the incorporation of multispectral information. Our framework optimizes a lightweight linear projection layer for alignment while keeping the vision backbone of SpectralGPT frozen. Our experiments encompass scene classification using linear probing and language modeling for jointly performing scene classification and description generation. Our results highlight Spectral LLaVA's ability to produce detailed and accurate descriptions, particularly for scenarios where RGB data alone proves inadequate, while also enhancing classification performance by refining SpectralGPT features into semantically meaningful representations.
Chinese: Spectral LLaVA框架通过整合多光谱数据和视觉-语言对齐技术,显著提升了遥感场景理解能力,在RGB数据不足时仍能实现精确分类和详细描述。
English: The Spectral LLaVA framework integrates multispectral data with vision-language alignment to improve scene understanding in remote sensing, achieving enhanced classification and detailed descriptions especially where RGB data is insufficient.

Authors:Farshad Rostami Ghadi, Masoud Kaveh, Kai-Kit Wong, Diego Martin, Riku Jantti, Zheng Yan
Title: Physical Layer Security in FAS-aided Wireless Powered NOMA Systems
Abstract:
The rapid evolution of communication technologies and the emergence of sixth-generation (6G) networks have introduced unprecedented opportunities for ultra-reliable, low-latency, and energy-efficient communication. However, the integration of advanced technologies like non-orthogonal multiple access (NOMA) and wireless powered communication networks (WPCNs) brings significant challenges, particularly in terms of energy constraints and security vulnerabilities. Traditional antenna systems and orthogonal multiple access schemes struggle to meet the increasing demands for performance and security in such environments. To address this gap, this paper investigates the impact of emerging fluid antenna systems (FAS) on the performance of physical layer security (PLS) in WPCNs. Specifically, we consider a scenario in which a transmitter, powered by a power beacon via an energy link, transmits confidential messages to legitimate FAS-aided users over information links while an external eavesdropper attempts to decode the transmitted signals. Additionally, users leverage the NOMA scheme, where the far user may also act as an internal eavesdropper. For the proposed model, we first derive the distributions of the equivalent channels at each node and subsequently obtain compact expressions for the secrecy outage probability (SOP) and average secrecy capacity (ASC), using the Gaussian quadrature methods. Our results reveal that incorporating the FAS for NOMA users, instead of the TAS, enhances the performance of the proposed secure WPCN.
Chinese: 在无线供能通信网络中,流体天线系统与非正交多址接入技术的结合,相比传统天线系统,显著提升了物理层安全性能,具体表现为改善了保密中断概率和平均保密容量。
English: The integration of fluid antenna systems (FAS) with non-orthogonal multiple access (NOMA) in wireless powered communication networks (WPCNs) significantly improves physical layer security by enhancing secrecy outage probability and average secrecy capacity compared to traditional antenna systems.

Authors:Jiajun Zhou, Wentao Fu, Hao Song, Shanqing Yu, Qi Xuan, Xiaoniu Yang
Title: Multi-view Correlation-aware Network Traffic Detection on Flow Hypergraph
Abstract:
As the Internet rapidly expands, the increasing complexity and diversity of network activities pose significant challenges to effective network governance and security regulation. Network traffic, which serves as a crucial data carrier of network activities, has become indispensable in this process. Network traffic detection aims to monitor, analyze, and evaluate the data flows transmitted across the network to ensure network security and optimize performance. However, existing network traffic detection methods generally suffer from several limitations: 1) a narrow focus on characterizing traffic features from a single perspective; 2) insufficient exploration of discriminative features for different traffic; 3) poor generalization to different traffic scenarios. To address these issues, we propose a multi-view correlation-aware framework named FlowID for network traffic detection. FlowID captures multi-view traffic features via temporal and interaction awareness, while a hypergraph encoder further explores higher-order relationships between flows. To overcome the challenges of data imbalance and label scarcity, we design a dual-contrastive proxy task, enhancing the framework's ability to differentiate between various traffic flows through traffic-to-traffic and group-to-group contrast. Extensive experiments on five real-world datasets demonstrate that FlowID significantly outperforms existing methods in accuracy, robustness, and generalization across diverse network scenarios, particularly in detecting malicious traffic.
中文:提出的FlowID框架通过时空感知和交互感知捕获多视角流量特征,并利用超图编码器挖掘流间高阶关系,有效提升了跨场景检测的准确性与泛化能力。
English: The proposed FlowID framework addresses limitations in network traffic detection by capturing multi-view features and higher-order relationships through temporal and interaction awareness, significantly improving accuracy and generalization across diverse scenarios.

Authors:Xiucheng Wang, Peilin Zheng, Nan Cheng
Title: Erasing Noise in Signal Detection with Diffusion Model: From Theory to Application
Abstract:
In this paper, a signal detection method based on the denoise diffusion model (DM) is proposed, which outperforms the maximum likelihood (ML) estimation method that has long been regarded as the optimal signal detection technique. Theoretically, a novel mathematical theory for intelligent signal detection based on stochastic differential equations (SDEs) is established in this paper, demonstrating the effectiveness of DM in reducing the additive white Gaussian noise in received signals. Moreover, a mathematical relationship between the signal-to-noise ratio (SNR) and the timestep in DM is established, revealing that for any given SNR, a corresponding optimal timestep can be identified. Furthermore, to address potential issues with out-of-distribution inputs in the DM, we employ a mathematical scaling technique that allows the trained DM to handle signal detection across a wide range of SNRs without any fine-tuning. Building on the above theoretical foundation, we propose a DM-based signal detection method, with the diffusion transformer (DiT) serving as the backbone neural network, whose computational complexity of this method is $\mathcal{O}(n^2)$. Simulation results demonstrate that, for BPSK and QAM modulation schemes, the DM-based method achieves a significantly lower symbol error rate (SER) compared to ML estimation, while maintaining a much lower computational complexity.
中文: 本文提出了一种基于去噪扩散模型的信号检测方法,通过随机微分方程建立数学理论框架,在BPSK和QAM调制方案中相比传统最大似然估计显著降低了误码率并保持较低计算复杂度。
English: This paper introduces a denoising diffusion model-based signal detection method that surpasses traditional maximum likelihood estimation by establishing a mathematical framework using stochastic differential equations, achieving lower symbol error rates and computational complexity for BPSK and QAM modulations.

Authors:Rujie Wu, Xiaojian Ma, Hai Ci, Yue Fan, Yuxuan Wang, Haozhe Zhao, Qing Li, Yizhou Wang
Title: LongViTU: Instruction Tuning for Long-Form Video Understanding
Abstract:
This paper introduces LongViTU, a large-scale (~121k QA pairs, ~900h videos), automatically generated dataset for long-form video understanding. We propose a systematic approach that organizes videos into a hierarchical tree structure for QA generation and incorporates self-revision mechanisms to ensure high-quality QA pairs. Each QA pair in LongViTU features: 1) long-term context (average certificate length of 4.6 minutes); 2) rich knowledge and condensed reasoning (commonsense, causality, planning, etc.)). We also offer explicit timestamp annotations of relevant events for each QA pair. We have conducted extensive human studies on LongViTU, and the results prove the quality of our dataset. To better evaluate the challenges posed by LongViTU's emphasis on long-term context and condensed reasoning, we manually curate a subset of LongViTU into a benchmark. Evaluations using a state-of-the-art open-source model (LongVU), a proprietary model (Gemini-1.5-Pro), and human annotators yield GPT-4 scores of 49.9, 52.3, and 81.0, respectively, underscoring the substantial difficulty presented by LongViTU questions. Performing supervised fine-tuning (SFT) of LongVU and LLaVA-Video on LongViTU data results in average performance gains of 2.5% and 3.7%, respectively, across a suite of long video understanding benchmarks (EgoSchema, VideoMME-Long, MLVU, LVBench).
中文: 本文介绍了LongViTU,一个通过分层树结构和自我修正机制生成的大规模长视频理解数据集,其问答对具有长时上下文和复杂推理特点,对现有模型构成显著挑战,人类与AI评估结果均证实了这一点。
English: This paper presents LongViTU, a large-scale dataset for long-form video understanding generated through a hierarchical tree structure and self-revision mechanisms, featuring QA pairs with extended context and complex reasoning that poses significant challenges to current models, as shown by human and AI evaluations.

Authors:Andrew Bond, Jui-Hsien Wang, Long Mai, Erkut Erdem, Aykut Erdem
Title: GaussianVideo: Efficient Video Representation via Hierarchical Gaussian Splatting
Abstract:
Efficient neural representations for dynamic video scenes are critical for applications ranging from video compression to interactive simulations. Yet, existing methods often face challenges related to high memory usage, lengthy training times, and temporal consistency. To address these issues, we introduce a novel neural video representation that combines 3D Gaussian splatting with continuous camera motion modeling. By leveraging Neural ODEs, our approach learns smooth camera trajectories while maintaining an explicit 3D scene representation through Gaussians. Additionally, we introduce a spatiotemporal hierarchical learning strategy, progressively refining spatial and temporal features to enhance reconstruction quality and accelerate convergence. This memory-efficient approach achieves high-quality rendering at impressive speeds. Experimental results show that our hierarchical learning, combined with robust camera motion modeling, captures complex dynamic scenes with strong temporal consistency, achieving state-of-the-art performance across diverse video datasets in both high- and low-motion scenarios.
中文: 本文提出了一种结合3D高斯抛洒和神经微分方程的动态视频表示方法,通过分层学习策略实现高效高质量渲染,在多种视频数据集上展现出优越的时间一致性和性能表现。
English: This paper introduces a neural video representation using 3D Gaussian splatting and Neural ODEs for smooth camera motion, enhanced by a hierarchical learning strategy to achieve efficient, high-quality rendering with strong temporal consistency across various video datasets.

Authors:Pedro R. A. S. Bassi, Mehmet Can Yavuz, Kang Wang, Xiaoxi Chen, Wenxuan Li, Sergio Decherchi, Andrea Cavalli, Yang Yang, Alan Yuille, Zongwei Zhou
Title: RadGPT: Constructing 3D Image-Text Tumor Datasets
Abstract:
Cancers identified in CT scans are usually accompanied by detailed radiology reports, but publicly available CT datasets often lack these essential reports. This absence limits their usefulness for developing accurate report generation AI. To address this gap, we present AbdomenAtlas 3.0, the first public, high-quality abdominal CT dataset with detailed, expert-reviewed radiology reports. All reports are paired with per-voxel masks and they describe liver, kidney and pancreatic tumors. AbdomenAtlas 3.0 has 9,262 triplets of CT, mask and report--3,955 with tumors. These CT scans come from 17 public datasets. Besides creating the reports for these datasets, we expanded their number of tumor masks by 4.2x, identifying 3,011 new tumor cases. Notably, the reports in AbdomenAtlas 3.0 are more standardized, and generated faster than traditional human-made reports. They provide details like tumor size, location, attenuation and surgical resectability. These reports were created by 12 board-certified radiologists using our proposed RadGPT, a novel framework that converted radiologist-revised tumor segmentation masks into structured and narrative reports. Besides being a dataset creation tool, RadGPT can also become a fully-automatic, segmentation-assisted report generation method. We benchmarked this method and 5 state-of-the-art report generation vision-language models. Our results show that segmentation strongly improves tumor detection in AI-made reports.
中文摘要:AbdomenAtlas 3.0推出了首个配备专家审核影像报告和肿瘤掩模的公开腹部CT数据集,通过提供9,262组结构化数据三元组填补了AI开发的关键空白,其采用的RadGPT框架不仅能自动生成标准化报告,还显著提升了肿瘤检测的准确性。
English Summary: AbdomenAtlas 3.0 introduces the first public abdominal CT dataset with expert-reviewed radiology reports and tumor masks, addressing the critical gap in AI development by providing 9,262 structured data triplets enhanced through RadGPT, which also serves as an automated report generation tool proven to boost tumor detection accuracy.

Authors:Zhi-Lin Huang, Yixuan Liu, Chujun Qin, Zhongdao Wang, Dong Zhou, Dong Li, Emad Barsoum
Title: Edit as You See: Image-guided Video Editing via Masked Motion Modeling
Abstract:
Recent advancements in diffusion models have significantly facilitated text-guided video editing. However, there is a relative scarcity of research on image-guided video editing, a method that empowers users to edit videos by merely indicating a target object in the initial frame and providing an RGB image as reference, without relying on the text prompts. In this paper, we propose a novel Image-guided Video Editing Diffusion model, termed IVEDiff for the image-guided video editing. IVEDiff is built on top of image editing models, and is equipped with learnable motion modules to maintain the temporal consistency of edited video. Inspired by self-supervised learning concepts, we introduce a masked motion modeling fine-tuning strategy that empowers the motion module's capabilities for capturing inter-frame motion dynamics, while preserving the capabilities for intra-frame semantic correlations modeling of the base image editing model. Moreover, an optical-flow-guided motion reference network is proposed to ensure the accurate propagation of information between edited video frames, alleviating the misleading effects of invalid information. We also construct a benchmark to facilitate further research. The comprehensive experiments demonstrate that our method is able to generate temporally smooth edited videos while robustly dealing with various editing objects with high quality.
中文摘要:本文提出IVEDiff,一种基于图像引导的视频编辑扩散模型,通过可学习运动模块和掩码运动建模策略,在不依赖文本提示的情况下确保视频时序一致性和高质量对象编辑效果。
English Summary: This paper introduces IVEDiff, a novel image-guided video editing diffusion model that uses learnable motion modules and a masked motion modeling strategy to ensure temporal consistency and high-quality object editing without relying on text prompts.

Authors:Yaodan Xu, Sheng Zhou, Zhisheng Niu
Title: SMDP-Based Dynamic Batching for Improving Responsiveness and Energy Efficiency of Batch Services
Abstract:
For servers incorporating parallel computing resources, batching is a pivotal technique for providing efficient and economical services at scale. Parallel computing resources exhibit heightened computational and energy efficiency when operating with larger batch sizes. However, in the realm of online services, the adoption of a larger batch size may lead to longer response times. This paper aims to provide a dynamic batching scheme that delicately balances latency and efficiency. The system is modeled as a batch service queue with size-dependent service times. Then, the design of dynamic batching is formulated as a semi-Markov decision process (SMDP) problem, with the objective of minimizing the weighted sum of average response time and average power consumption. A method is proposed to derive an approximate optimal SMDP solution, representing the chosen dynamic batching policy. By introducing an abstract cost to reflect the impact of "tail" states, the space complexity and the time complexity of the procedure can decrease by 63.5% and 98%, respectively. Numerical results showcase the superiority of SMDP-based batching policies across various parameter setups. Additionally, the proposed scheme exhibits noteworthy flexibility in balancing power consumption and latency.
中文摘要:本文提出了一种基于半马尔可夫决策过程的动态批处理方案,通过建立批量服务队列模型,在保证系统灵活性的同时有效平衡并行计算中的延迟与能效,显著降低了计算复杂度。
English Summary: This paper introduces a dynamic batching scheme using a semi-Markov decision process to optimize the trade-off between latency and efficiency in parallel computing systems, achieving significant complexity reductions while maintaining flexibility in balancing power consumption and response time.

Authors:Liang He, Yougang Chu, Zhen Wu, Jianbing Zhang, Xinyu Dai, Jiajun Chen
Title: Rethinking Relation Extraction: Beyond Shortcuts to Generalization with a Debiased Benchmark
Abstract:
Benchmarks are crucial for evaluating machine learning algorithm performance, facilitating comparison and identifying superior solutions. However, biases within datasets can lead models to learn shortcut patterns, resulting in inaccurate assessments and hindering real-world applicability. This paper addresses the issue of entity bias in relation extraction tasks, where models tend to rely on entity mentions rather than context. We propose a debiased relation extraction benchmark DREB that breaks the pseudo-correlation between entity mentions and relation types through entity replacement. DREB utilizes Bias Evaluator and PPL Evaluator to ensure low bias and high naturalness, providing a reliable and accurate assessment of model generalization in entity bias scenarios. To establish a new baseline on DREB, we introduce MixDebias, a debiasing method combining data-level and model training-level techniques. MixDebias effectively improves model performance on DREB while maintaining performance on the original dataset. Extensive experiments demonstrate the effectiveness and robustness of MixDebias compared to existing methods, highlighting its potential for improving the generalization ability of relation extraction models. We will release DREB and MixDebias publicly.
中文摘要:本文提出了DREB这一消除实体偏见的关系抽取基准,通过实体替换打破伪相关性,并开发了MixDebias方法,在保持原数据集性能的同时有效提升模型泛化能力。
English Summary: This paper introduces DREB, a debiased benchmark for relation extraction that mitigates entity bias through entity replacement, along with MixDebias, a novel method that enhances model generalization while maintaining original performance.

Authors:Yixing Xu, Shivank Nag, Dong Li, Lu Tian, Emad Barsoum
Title: MSWA: Refining Local Attention with Multi-ScaleWindow Attention
Abstract:
Transformer-based LLMs have achieved exceptional performance across a wide range of NLP tasks. However, the standard self-attention mechanism suffers from quadratic time complexity and linearly increased cache size. Sliding window attention (SWA) solves this problem by restricting the attention range to a fixed-size local context window. Nevertheless, SWA employs a uniform window size for each head in each layer, making it inefficient in capturing context of varying scales. To mitigate this limitation, we propose Multi-Scale Window Attention (MSWA) which applies diverse window sizes across heads and layers in the Transformer. It not only allows for different window sizes among heads within the same layer but also progressively increases window size allocation from shallow to deep layers, thus enabling the model to capture contextual information with different lengths and distances. Experimental results on language modeling and common-sense reasoning tasks substantiate that MSWA outperforms traditional local attention in both effectiveness and efficiency.
中文: 提出的多尺度窗口注意力(MSWA)通过在不同头和层中采用多样化窗口大小,克服了滑动窗口注意力中统一窗口尺寸的局限,从而在语言任务中更高效地捕捉多尺度上下文信息。
English: The proposed Multi-Scale Window Attention (MSWA) overcomes the limitations of uniform window sizes in sliding window attention by employing diverse window sizes across heads and layers, enabling more effective and efficient capture of multi-scale contextual information in language tasks.

Authors:Jiaxin Ge, Zora Zhiruo Wang, Xuhui Zhou, Yi-Hao Peng, Sanjay Subramanian, Qinyue Tan, Maarten Sap, Alane Suhr, Daniel Fried, Graham Neubig, Trevor Darrell
Title: AutoPresent: Designing Structured Visuals from Scratch
Abstract:
Designing structured visuals such as presentation slides is essential for communicative needs, necessitating both content creation and visual planning skills. In this work, we tackle the challenge of automated slide generation, where models produce slide presentations from natural language (NL) instructions. We first introduce the SlidesBench benchmark, the first benchmark for slide generation with 7k training and 585 testing examples derived from 310 slide decks across 10 domains. SlidesBench supports evaluations that are (i)reference-based to measure similarity to a target slide, and (ii)reference-free to measure the design quality of generated slides alone. We benchmark end-to-end image generation and program generation methods with a variety of models, and find that programmatic methods produce higher-quality slides in user-interactable formats. Built on the success of program generation, we create AutoPresent, an 8B Llama-based model trained on 7k pairs of instructions paired with code for slide generation, and achieve results comparable to the closed-source model GPT-4o. We further explore iterative design refinement where the model is tasked to self-refine its own output, and we found that this process improves the slide's quality. We hope that our work will provide a basis for future work on generating structured visuals.
中文: 本研究提出了首个基于自然语言指令的幻灯片生成基准SlidesBench,并开发了开源模型AutoPresent,通过程序化生成和迭代优化实现了与GPT-4o相当的性能。
English: This work introduces SlidesBench, the first benchmark for automated slide generation from natural language instructions, and presents AutoPresent, an open-source model that achieves performance comparable to GPT-4o through programmatic methods and iterative refinement.

Authors:Rui Meng, Song Gao, Dayu Fan, Haixiao Gao, Yining Wang, Xiaodong Xu, Bizhu Wang, Suyu Lv, Zhidi Zhang, Mengying Sun, Shujun Han, Chen Dong, Xiaofeng Tao, Ping Zhang
Title: A Survey of Secure Semantic Communications
Abstract:
Semantic communication (SemCom) is regarded as a promising and revolutionary technology in 6G, aiming to transcend the constraints of ``Shannon's trap" by filtering out redundant information and extracting the core of effective data. Compared to traditional communication paradigms, SemCom offers several notable advantages, such as reducing the burden on data transmission, enhancing network management efficiency, and optimizing resource allocation. Numerous researchers have extensively explored SemCom from various perspectives, including network architecture, theoretical analysis, potential technologies, and future applications. However, as SemCom continues to evolve, a multitude of security and privacy concerns have arisen, posing threats to the confidentiality, integrity, and availability of SemCom systems. This paper presents a comprehensive survey of the technologies that can be utilized to secure SemCom. Firstly, we elaborate on the entire life cycle of SemCom, which includes the model training, model transfer, and semantic information transmission phases. Then, we identify the security and privacy issues that emerge during these three stages. Furthermore, we summarize the techniques available to mitigate these security and privacy threats, including data cleaning, robust learning, defensive strategies against backdoor attacks, adversarial training, differential privacy, cryptography, blockchain technology, model compression, and physical-layer security. Lastly, this paper outlines future research directions to guide researchers in related fields.
中文: 语义通信作为6G革命性技术,通过传输核心数据突破香农限制,但面临安全隐私威胁,需采用鲁棒学习、加密等多重防护技术应对。
English: Semantic communication is a revolutionary 6G technology that overcomes Shannon's limitations by transmitting essential data, yet faces security challenges addressed through various protective measures like robust learning and cryptography.

Authors:Tianshi Zheng, Jiazheng Wang, Zihao Wang, Jiaxin Bai, Hang Yin, Zheye Deng, Yangqiu Song, Jianxin Li
Title: Enhancing Transformers for Generalizable First-Order Logical Entailment
Abstract:
Transformers, as the fundamental deep learning architecture, have demonstrated great capability in reasoning. This paper studies the generalizable first-order logical reasoning ability of transformers with their parameterized knowledge and how to improve it. Transformers' capability of first-order reasoning is further captured by whether they can conduct first-order logical entailment, which is quantitatively measured by their performance in answering knowledge graph queries. We establish the connections between (1) two types of distribution shifts studied in out-of-distribution generalization and (2) unseen knowledge and query settings discussed in the task of knowledge graph query answering, which makes it possible to characterize the fine-grained generalizability. Results on our comprehensive dataset showed that transformers \textit{outperform} previous methods designed particularly for this task and provided detailed empirical evidence about the impact of the input query syntax, token embedding, and transformer architectures on their reasoning capability. Interestingly, our results revealed the mismatch of positional encoding and other design choices of transformer architectures in previous practices. Motivated by this, we propose TEGA, a logic-aware architecture that significantly improves the performance in generalizable first-order logical entailment.
中文: 本文研究了Transformer模型在一阶逻辑推理中的泛化能力,并提出了一种逻辑感知架构TEGA,通过全面实证分析显著提升了其在可泛化逻辑蕴含任务中的性能,超越了此前专门设计的算法。
English: This paper investigates transformers' ability to perform first-order logical reasoning and introduces TEGA, a logic-aware architecture that significantly enhances their performance in generalizable logical entailment, outperforming previous methods through comprehensive empirical analysis.

Authors:Yue Fan, Xiaojian Ma, Rongpeng Su, Jun Guo, Rujie Wu, Xi Chen, Qing Li
Title: Embodied VideoAgent: Persistent Memory from Egocentric Videos and Embodied Sensors Enables Dynamic Scene Understanding
Abstract:
This paper investigates the problem of understanding dynamic 3D scenes from egocentric observations, a key challenge in robotics and embodied AI. Unlike prior studies that explored this as long-form video understanding and utilized egocentric video only, we instead propose an LLM-based agent, Embodied VideoAgent, which constructs scene memory from both egocentric video and embodied sensory inputs (e.g. depth and pose sensing). We further introduce a VLM-based approach to automatically update the memory when actions or activities over objects are perceived. Embodied VideoAgent attains significant advantages over counterparts in challenging reasoning and planning tasks in 3D scenes, achieving gains of 4.9% on Ego4D-VQ3D, 5.8% on OpenEQA, and 11.7% on EnvQA. We have also demonstrated its potential in various embodied AI tasks including generating embodied interactions and perception for robot manipulation. The code and demo will be made public.
中文: 本文提出Embodied VideoAgent,一种基于大语言模型的智能体,通过结合第一人称视频与具身感知输入构建动态3D场景记忆,在多项推理与规划任务中显著优于现有方法。
English: This paper introduces Embodied VideoAgent, an LLM-based agent that integrates egocentric video with embodied sensory inputs to build dynamic 3D scene memory, achieving superior performance in reasoning and planning tasks over existing methods.

Authors:Chia-Yuan Chang, Zhimeng Jiang, Vineeth Rakesh, Menghai Pan, Chin-Chia Michael Yeh, Guanchu Wang, Mingzhi Hu, Zhichao Xu, Yan Zheng, Mahashweta Das, Na Zou
Title: MAIN-RAG: Multi-Agent Filtering Retrieval-Augmented Generation
Abstract:
Large Language Models (LLMs) are becoming essential tools for various natural language processing tasks but often suffer from generating outdated or incorrect information. Retrieval-Augmented Generation (RAG) addresses this issue by incorporating external, real-time information retrieval to ground LLM responses. However, the existing RAG systems frequently struggle with the quality of retrieval documents, as irrelevant or noisy documents degrade performance, increase computational overhead, and undermine response reliability. To tackle this problem, we propose Multi-Agent Filtering Retrieval-Augmented Generation (MAIN-RAG), a training-free RAG framework that leverages multiple LLM agents to collaboratively filter and score retrieved documents. Specifically, MAIN-RAG introduces an adaptive filtering mechanism that dynamically adjusts the relevance filtering threshold based on score distributions, effectively minimizing noise while maintaining high recall of relevant documents. The proposed approach leverages inter-agent consensus to ensure robust document selection without requiring additional training data or fine-tuning. Experimental results across four QA benchmarks demonstrate that MAIN-RAG consistently outperforms traditional RAG approaches, achieving a 2-11% improvement in answer accuracy while reducing the number of irrelevant retrieved documents. Quantitative analysis further reveals that our approach achieves superior response consistency and answer accuracy over baseline methods, offering a competitive and practical alternative to training-based solutions.
中文: MAIN-RAG框架通过多智能体协同过滤检索文档,采用自适应阈值机制和群体共识策略,在无需额外训练的情况下将答案准确率提升2-11%,同时有效减少无关文档数量。
English: The proposed MAIN-RAG framework employs multiple LLM agents to collaboratively filter retrieved documents through adaptive thresholding and inter-agent consensus, significantly improving answer accuracy by 2-11% while reducing irrelevant documents compared to traditional RAG systems.

Authors:Wenzhi Fang, Dong-Jun Han, Liangqi Yuan, Seyyedali Hosseinalipour, Christopher G. Brinton
Title: Federated Sketching LoRA: A Flexible Framework for Heterogeneous Collaborative Fine-Tuning of LLMs
Abstract:
Fine-tuning large language models (LLMs) on resource-constrained clients remains a challenging problem. Recent works have fused low-rank adaptation (LoRA) techniques with federated fine-tuning to mitigate challenges associated with client model sizes and data scarcity. Still, the heterogeneity of resources remains a critical bottleneck: while higher-rank modules generally enhance performance, varying client capabilities constrain LoRA's feasible rank range. Existing approaches attempting to resolve this issue either lack analytical justification or impose additional computational overhead, leaving a wide gap for efficient and theoretically-grounded solutions. To address these challenges, we propose federated sketching LoRA (FSLoRA), which leverages a sketching mechanism to enable clients to selectively update submatrices of global LoRA modules maintained by the server. By adjusting the sketching ratios, which determine the ranks of the submatrices on the clients, FSLoRA flexibly adapts to client-specific communication and computational constraints. We provide a rigorous convergence analysis of FSLoRA that characterizes how the sketching ratios affect the convergence rate. Through comprehensive experiments on multiple datasets and LLM models, we demonstrate FSLoRA's performance improvements compared to various baselines.
中文: 在资源受限的客户端微调大语言模型具有挑战性,但FSLoRA通过草图机制使客户端能选择性更新全局LoRA模块的子矩阵,灵活适应不同客户端的通信与计算限制,并提供了严格的收敛性分析和实验性能提升验证。
English: Fine-tuning large language models on resource-limited clients is challenging, but FSLoRA introduces a sketching mechanism that allows clients to selectively update submatrices of global LoRA modules, adapting to their specific constraints while providing rigorous convergence analysis and demonstrating performance gains in experiments.

Authors:Enze Xie, Junsong Chen, Yuyang Zhao, Jincheng Yu, Ligeng Zhu, Chengyue Wu, Yujun Lin, Zhekai Zhang, Muyang Li, Junyu Chen, Han Cai, Bingchen Liu, Daquan Zhou, Song Han
Title: SANA 1.5: Efficient Scaling of Training-Time and Inference-Time Compute in Linear Diffusion Transformer
Abstract:
This paper presents SANA-1.5, a linear Diffusion Transformer for efficient scaling in text-to-image generation. Building upon SANA-1.0, we introduce three key innovations: (1) Efficient Training Scaling: A depth-growth paradigm that enables scaling from 1.6B to 4.8B parameters with significantly reduced computational resources, combined with a memory-efficient 8-bit optimizer. (2) Model Depth Pruning: A block importance analysis technique for efficient model compression to arbitrary sizes with minimal quality loss. (3) Inference-time Scaling: A repeated sampling strategy that trades computation for model capacity, enabling smaller models to match larger model quality at inference time. Through these strategies, SANA-1.5 achieves a text-image alignment score of 0.81 on GenEval, which can be further improved to 0.96 through inference scaling with VILA-Judge, establishing a new SoTA on GenEval benchmark. These innovations enable efficient model scaling across different compute budgets while maintaining high quality, making high-quality image generation more accessible. Our code and pre-trained models are released.
中文: SANA-1.5提出了一种线性扩散Transformer,通过高效训练扩展、模型深度剪枝和推理时扩展三大创新,在不同计算预算下高效实现了文本到图像生成的顶尖性能。
English: SANA-1.5 introduces a linear Diffusion Transformer with three innovations—efficient training scaling, model depth pruning, and inference-time scaling—to achieve state-of-the-art text-to-image generation efficiently across various computational budgets.

Authors:Hasret Taha Akçalı, Özlem Tuğfe Demir, Tolga Girici, Emil Björnson
Title: Predictive Beamforming with Distributed MIMO
Abstract:
In vehicle-to-everything (V2X) applications, roadside units (RSUs) can be tasked with both sensing and communication functions to enable sensing-assisted communications. Recent studies have demonstrated that distance, angle, and velocity information obtained through sensing can be leveraged to reduce the overhead associated with communication beam tracking. In this work, we extend this concept to scenarios involving multiple distributed RSUs and distributed MIMO (multiple-input multiple-output) systems. We derive the state evolution model, formulate the extended Kalman-filter equations, and implement predictive beamforming for distributed MIMO. Simulation results indicate that, when compared with a co-located massive MIMO antenna array, distributed antennas lead to more uniform and robust sensing performance, coverage, and data rates, while the vehicular user is in motion.
中文: 在车联网应用中,分布式路侧单元通过传感辅助通信,采用扩展卡尔曼滤波实现分布式多输入多输出系统的预测性波束成形,相比集中式大规模天线阵列,能在车辆移动时提供更均匀、更稳健的性能。
English: In V2X applications, distributed RSUs with sensing-assisted communications use extended Kalman filtering for predictive beamforming in distributed MIMO systems, achieving more uniform and robust performance than co-located massive MIMO arrays during vehicle motion.

Authors:Subhankar Maity, Aniket Deroy, Sudeshna Sarkar
Title: Leveraging In-Context Learning and Retrieval-Augmented Generation for Automatic Question Generation in Educational Domains
Abstract:
Question generation in education is a time-consuming and cognitively demanding task, as it requires creating questions that are both contextually relevant and pedagogically sound. Current automated question generation methods often generate questions that are out of context. In this work, we explore advanced techniques for automated question generation in educational contexts, focusing on In-Context Learning (ICL), Retrieval-Augmented Generation (RAG), and a novel Hybrid Model that merges both methods. We implement GPT-4 for ICL using few-shot examples and BART with a retrieval module for RAG. The Hybrid Model combines RAG and ICL to address these issues and improve question quality. Evaluation is conducted using automated metrics, followed by human evaluation metrics. Our results show that both the ICL approach and the Hybrid Model consistently outperform other methods, including baseline models, by generating more contextually accurate and relevant questions.
中文摘要:本研究探索了教育领域自动生成问题的先进技术,包括情境学习、检索增强生成及新型混合模型,结果表明情境学习方法和混合模型能生成更符合语境且相关的问题,显著优于其他方法。
English Summary: This study explores advanced automated question generation techniques in education, including In-Context Learning, Retrieval-Augmented Generation, and a novel Hybrid Model, with results showing that ICL and the Hybrid Model outperform other methods by producing more contextually accurate and relevant questions.

Authors:Zhongyu Jiang, Wenhao Chai, Zhuoran Zhou, Cheng-Yen Yang, Hsiang-Wei Huang, Jenq-Neng Hwang
Title: PackDiT: Joint Human Motion and Text Generation via Mutual Prompting
Abstract:
Human motion generation has advanced markedly with the advent of diffusion models. Most recent studies have concentrated on generating motion sequences based on text prompts, commonly referred to as text-to-motion generation. However, the bidirectional generation of motion and text, enabling tasks such as motion-to-text alongside text-to-motion, has been largely unexplored. This capability is essential for aligning diverse modalities and supports unconditional generation. In this paper, we introduce PackDiT, the first diffusion-based generative model capable of performing various tasks simultaneously, including motion generation, motion prediction, text generation, text-to-motion, motion-to-text, and joint motion-text generation. Our core innovation leverages mutual blocks to integrate multiple diffusion transformers (DiTs) across different modalities seamlessly. We train PackDiT on the HumanML3D dataset, achieving state-of-the-art text-to-motion performance with an FID score of 0.106, along with superior results in motion prediction and in-between tasks. Our experiments further demonstrate that diffusion models are effective for motion-to-text generation, achieving performance comparable to that of autoregressive models.
Chinese: PackDiT是首个基于扩散模型的双向运动-文本生成框架,通过融合多种扩散变换器,在文本驱动运动生成和运动转文本等任务中实现了最先进的性能。
English: PackDiT is the first diffusion-based model that enables bidirectional motion-text generation, integrating multiple diffusion transformers to achieve state-of-the-art performance across tasks like text-to-motion and motion-to-text.

Authors:Renshan Zhang, Rui Shao, Gongwei Chen, Miao Zhang, Kaiwen Zhou, Weili Guan, Liqiang Nie
Title: FALCON: Resolving Visual Redundancy and Fragmentation in High-resolution Multimodal Large Language Models via Visual Registers
Abstract:
The incorporation of high-resolution visual input equips multimodal large language models (MLLMs) with enhanced visual perception capabilities for real-world tasks. However, most existing high-resolution MLLMs rely on a cropping-based approach to process images, which leads to fragmented visual encoding and a sharp increase in redundant tokens. To tackle these issues, we propose the FALCON model. FALCON introduces a novel visual register technique to simultaneously: 1) Eliminate redundant tokens at the stage of visual encoding. To directly address the visual redundancy present in the output of vision encoder, we propose a Register-based Representation Compacting (ReCompact) mechanism. This mechanism introduces a set of learnable visual registers designed to adaptively aggregate essential information while discarding redundancy. It enables the encoder to produce a more compact visual representation with a minimal number of output tokens, thus eliminating the need for an additional compression module. 2) Ensure continuity in visual encoding. To address the potential encoding errors caused by fragmented visual inputs, we develop a Register Interactive Attention (ReAtten) module. This module facilitates effective and efficient information exchange across sub-images by enabling interactions between visual registers. It ensures the continuity of visual semantics throughout the encoding. We conduct comprehensive experiments with FALCON on high-resolution benchmarks across a wide range of scenarios. FALCON demonstrates superior performance with a remarkable 9-fold reduction in visual tokens.
中文: FALCON模型通过创新的视觉寄存器技术,有效消除冗余标记并确保视觉编码的连续性,在显著减少标记数量的同时实现了卓越性能。
English: The FALCON model introduces a novel visual register technique that eliminates redundant tokens and ensures continuity in visual encoding, achieving superior performance with a significant reduction in tokens.

Authors:Haoran Lu, Xun Jiang, Yanbang Chu, Ziqiao Xu, Rui Guo, Wanyue Peng, Yibo Lin, Runsheng Wang, Heng Wu, Ru Huang
Title: A Tale of Two Sides of Wafer: Physical Implementation and Block-Level PPA on Flip FET with Dual-sided Signals
Abstract:
As the conventional scaling of logic devices comes to an end, functional wafer backside and 3D transistor stacking are consensus for next-generation logic technology, offering considerable design space extension for powers, signals or even devices on the wafer backside. The Flip FET (FFET), a novel transistor architecture combining 3D transistor stacking and fully functional wafer backside, was recently proposed. With symmetric dual-sided standard cell design, the FFET can deliver around 12.5% cell area scaling and faster but more energy-efficient libraries beyond other stacked transistor technologies such as CFET. Besides, thanks to the novel cell design with dual-sided pins, the FFET supports dual-sided signal routing, delivering better routability and larger backside design space. In this work, we demonstrated a comprehensive FFET evaluation framework considering physical implementation and block-level power-performance-area (PPA) assessment for the first time, in which key functions are dual-sided routing and dual-sided RC extraction. A 32-bit RISC-V core was used for the evaluation here. Compared to the CFET with single-sided signals, the FFET with single-sided signals achieved 23.3% post-P&R core area reduction, 25.0% higher frequency and 11.9% lower power at the same utilization, and 16.0 % higher frequency at the same core area. Meanwhile, the FFET supports dual-sided signals, which can further benefit more from flexible allocation of cell input pins on both sides. By optimizing the input pin density and BEOL routing layer number on each side, 10.6% frequency gain was realized without power degradation compared to the one with single-sided signal routing. Moreover, the routability and power efficiency of FFET barely degrades even with the routing layer number reduced from 12 to 5 on each side, validating the great space for cost-friendly design enabled by FFET.
中文摘要:Flip FET(FFET)架构通过双面晶体管堆叠和布线设计,相比CFET等现有技术实现了芯片面积、运行频率和能效的显著提升,同时具备更优的设计灵活性与成本控制潜力。
English summary: The Flip FET (FFET) architecture enables dual-sided transistor stacking and routing, achieving significant improvements in chip area, frequency, and power efficiency compared to existing technologies like CFET, while offering greater design flexibility and cost-friendly options.

Authors:Abdelrahman Abdallah, Jamshid Mozafari, Bhawna Piryani, Adam Jatowt
Title: ASRank: Zero-Shot Re-Ranking with Answer Scent for Document Retrieval
Abstract:
Retrieval-Augmented Generation (RAG) models have drawn considerable attention in modern open-domain question answering. The effectiveness of RAG depends on the quality of the top retrieved documents. However, conventional retrieval methods sometimes fail to rank the most relevant documents at the top. In this paper, we introduce ASRank, a new re-ranking method based on scoring retrieved documents using zero-shot answer scent which relies on a pre-trained large language model to compute the likelihood of the document-derived answers aligning with the answer scent. Our approach demonstrates marked improvements across several datasets, including NQ, TriviaQA, WebQA, ArchivalQA, HotpotQA, and Entity Questions. Notably, ASRank increases Top-1 retrieval accuracy on NQ from $19.2\%$ to $46.5\%$ for MSS and $22.1\%$ to $47.3\%$ for BM25. It also shows strong retrieval performance on several datasets compared to state-of-the-art methods (47.3 Top-1 by ASRank vs 35.4 by UPR by BM25).
中文: 本文提出ASRank,一种基于零样本答案线索的新型重排序方法,利用预训练大语言模型评估文档与答案线索的匹配度,在多个数据集上显著提升了检索准确性,优于现有先进方法。
English: The paper introduces ASRank, a novel re-ranking method that uses zero-shot answer scent with a pre-trained large language model to score documents, significantly improving retrieval accuracy across multiple datasets compared to conventional methods.

Authors:Taslim Murad, Prakash Chourasia, Sarwan Ali, Imdad Ullah Khan, Murray Patterson
Title: Neuromorphic Spiking Neural Network Based Classification of COVID-19 Spike Sequences
Abstract:
The availability of SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2) virus data post-COVID has reached exponentially to an enormous magnitude, opening research doors to analyze its behavior. Various studies are conducted by researchers to gain a deeper understanding of the virus, like genomic surveillance, etc, so that efficient prevention mechanisms can be developed. However, the unstable nature of the virus (rapid mutations, multiple hosts, etc) creates challenges in designing analytical systems for it. Therefore, we propose a neural network-based (NN) mechanism to perform an efficient analysis of the SARS-CoV-2 data, as NN portrays generalized behavior upon training. Moreover, rather than using the full-length genome of the virus, we apply our method to its spike region, as this region is known to have predominant mutations and is used to attach to the host cell membrane. In this paper, we introduce a pipeline that first converts the spike protein sequences into a fixed-length numerical representation and then uses Neuromorphic Spiking Neural Network to classify those sequences. We compare the performance of our method with various baselines using real-world SARS-CoV-2 spike sequence data and show that our method is able to achieve higher predictive accuracy compared to the recent baselines.
中文: 本研究提出一种基于神经网络的机制,通过将SARS-CoV-2病毒刺突蛋白序列转化为数值表示并采用神经形态脉冲神经网络进行分类,相比现有基线方法实现了更高的预测准确率。
English: The study proposes a neural network-based approach to efficiently analyze SARS-CoV-2 spike protein data by converting sequences into numerical representations and using a neuromorphic spiking neural network for classification, achieving higher predictive accuracy than existing methods.

Authors:Liang Wang, Haonan Chen, Nan Yang, Xiaolong Huang, Zhicheng Dou, Furu Wei
Title: Chain-of-Retrieval Augmented Generation
Abstract:
This paper introduces an approach for training o1-like RAG models that retrieve and reason over relevant information step by step before generating the final answer. Conventional RAG methods usually perform a single retrieval step before the generation process, which limits their effectiveness in addressing complex queries due to imperfect retrieval results. In contrast, our proposed method, CoRAG (Chain-of-Retrieval Augmented Generation), allows the model to dynamically reformulate the query based on the evolving state. To train CoRAG effectively, we utilize rejection sampling to automatically generate intermediate retrieval chains, thereby augmenting existing RAG datasets that only provide the correct final answer. At test time, we propose various decoding strategies to scale the model's test-time compute by controlling the length and number of sampled retrieval chains. Experimental results across multiple benchmarks validate the efficacy of CoRAG, particularly in multi-hop question answering tasks, where we observe more than 10 points improvement in EM score compared to strong baselines. On the KILT benchmark, CoRAG establishes a new state-of-the-art performance across a diverse range of knowledge-intensive tasks. Furthermore, we offer comprehensive analyses to understand the scaling behavior of CoRAG, laying the groundwork for future research aimed at developing factual and grounded foundation models.
中文: 本文提出CoRAG方法,通过动态查询重构和自动生成检索链实现迭代式检索与推理,显著提升了多跳问答等复杂任务的表现,并在KILT基准测试中创下新纪录。
English: This paper presents CoRAG, a method that enhances RAG models by enabling iterative retrieval and reasoning through dynamic query reformulation and automated chain generation, achieving significant performance gains in complex tasks like multi-hop QA and setting new benchmarks on KILT.

Authors:Xi Xiao, Zhengji Li, Wentao Wang, Jiacheng Xie, Houjie Lin, Swalpa Kumar Roy, Tianyang Wang, Min Xu
Title: TD-RD: A Top-Down Benchmark with Real-Time Framework for Road Damage Detection
Abstract:
Object detection has witnessed remarkable advancements over the past decade, largely driven by breakthroughs in deep learning and the proliferation of large scale datasets. However, the domain of road damage detection remains relatively under explored, despite its critical significance for applications such as infrastructure maintenance and road safety. This paper addresses this gap by introducing a novel top down benchmark that offers a complementary perspective to existing datasets, specifically tailored for road damage detection. Our proposed Top Down Road Damage Detection Dataset (TDRD) includes three primary categories of road damage cracks, potholes, and patches captured from a top down viewpoint. The dataset consists of 7,088 high resolution images, encompassing 12,882 annotated instances of road damage. Additionally, we present a novel real time object detection framework, TDYOLOV10, designed to handle the unique challenges posed by the TDRD dataset. Comparative studies with state of the art models demonstrate competitive baseline results. By releasing TDRD, we aim to accelerate research in this crucial area. A sample of the dataset will be made publicly available upon the paper's acceptance.
中文摘要:本文针对道路损伤检测领域提出了一种新颖的俯视视角基准数据集TDRD及实时检测框架TDYOLOV10,通过包含7,088张高分辨率图像和12,882个标注实例填补了该领域的研究空白。
English Summary: This paper introduces a novel top-down benchmark dataset (TDRD) for road damage detection and proposes a real-time detection framework (TDYOLOV10) to address this under-explored domain, achieving competitive baseline results.

Authors:Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, Ziwei Liu
Title: Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos
Abstract:
Humans acquire knowledge through three cognitive stages: perceiving information, comprehending knowledge, and adapting knowledge to solve novel problems. Videos serve as an effective medium for this learning process, facilitating a progression through these cognitive stages. However, existing video benchmarks fail to systematically evaluate the knowledge acquisition capabilities in Large Multimodal Models (LMMs). To address this gap, we introduce Video-MMMU, a multi-modal, multi-disciplinary benchmark designed to assess LMMs' ability to acquire and utilize knowledge from videos. Video-MMMU features a curated collection of 300 expert-level videos and 900 human-annotated questions across six disciplines, evaluating knowledge acquisition through stage-aligned question-answer pairs: Perception, Comprehension, and Adaptation. A proposed knowledge gain metric, Δknowledge, quantifies improvement in performance after video viewing. Evaluation of LMMs reveals a steep decline in performance as cognitive demands increase and highlights a significant gap between human and model knowledge acquisition, underscoring the need for methods to enhance LMMs' capability to learn and adapt from videos.
Chinese: Video-MMMU基准通过三个认知阶段评估大型多模态模型从视频中获取知识的能力,揭示了随着任务复杂度增加,模型表现与人类存在显著差距。
English: The Video-MMMU benchmark evaluates Large Multimodal Models' knowledge acquisition through videos across three cognitive stages, revealing significant performance gaps compared to humans as task complexity increases.

Authors:Chenjia Bai, Yang Zhang, Shuang Qiu, Qiaosheng Zhang, Kang Xu, Xuelong Li
Title: Online Preference Alignment for Language Models via Count-based Exploration
Abstract:
Reinforcement Learning from Human Feedback (RLHF) has shown great potential in fine-tuning Large Language Models (LLMs) to align with human preferences. Existing methods perform preference alignment from a fixed dataset, which can be limited in data coverage, and the resulting reward model is hard to generalize in out-of-distribution responses. Thus, online RLHF is more desirable to empower the LLM to explore outside the support of the initial dataset by iteratively collecting the prompt-response pairs. In this paper, we study the fundamental problem in online RLHF, i.e. \emph{how to explore} for LLM. We give a theoretical motivation in linear reward assumption to show that an optimistic reward with an upper confidence bound (UCB) term leads to a provably efficient RLHF policy. Then, we reformulate our objective to direct preference optimization with an exploration term, where the UCB-term can be converted to a count-based exploration bonus. We further propose a practical algorithm, named \emph{Count-based Online Preference Optimization (COPO)}, which leverages a simple coin-flip counting module to estimate the pseudo-count of a prompt-response pair in previously collected data. COPO encourages LLMs to balance exploration and preference optimization in an iterative manner, which enlarges the exploration space and the entire data coverage of iterative LLM policies. We conduct online RLHF experiments on Zephyr and Llama-3 models. The results on instruction-following and standard academic benchmarks show that COPO significantly increases performance.
中文: 在线RLHF通过迭代探索克服了固定数据集的局限性,而提出的COPO算法有效平衡了探索与偏好优化,从而显著提升了大型语言模型的性能。
English: Online RLHF addresses the limitations of fixed datasets by enabling iterative exploration, and the proposed COPO algorithm efficiently balances exploration with preference optimization to enhance LLM performance.

Authors:Viswanathan Ramachandran, Tobias J. Oechtering, Mikael Skoglund
Title: Multi-terminal Strong Coordination over Noisy Channels with Encoder Co-operation
Abstract:
We investigate the problem of strong coordination over a multiple-access channel (MAC) with cribbing encoders. In this configuration, two encoders observe independent and identically distributed (i.i.d.) samples of a source random variable each and encode the inputs to the MAC. The decoder which observes the output of the MAC together with side-information, must generate approximately i.i.d. samples of another random variable which is jointly distributed with the two sources and the side information. We also allow for possible encoder cooperation, where one of the encoders can non-causally crib from the other encoders input. Independent pairwise shared randomness is assumed between each encoder and the decoder at limited rates. Firstly, in the presence of cribbing, we derive an achievable region based on joint source-channel coding. We also prove that in the absence of cribbing, our inner bound is tight for the special case when the MAC is composed of deterministic links, and the sources are conditionally independent given the side information. We then explicitly compute the regions for an example both with and without cribbing between the encoders, and demonstrate that cribbing strictly improves upon the achievable region.
Chinese: 本研究探讨了带有窥视编码器的多址接入信道中的强协调问题,推导了可达区域并证明编码器间的窥视能严格提升协调性能。
English: This study explores strong coordination in a multiple-access channel with cribbing encoders, deriving an achievable region and demonstrating that cribbing strictly enhances coordination performance.

Authors:Jinyu Wang, Jingjing Fu, Rui Wang, Lei Song, Jiang Bian
Title: PIKE-RAG: sPecIalized KnowledgE and Rationale Augmented Generation
Abstract:
Despite notable advancements in Retrieval-Augmented Generation (RAG) systems that expand large language model (LLM) capabilities through external retrieval, these systems often struggle to meet the complex and diverse needs of real-world industrial applications. The reliance on retrieval alone proves insufficient for extracting deep, domain-specific knowledge performing in logical reasoning from specialized corpora. To address this, we introduce sPecIalized KnowledgE and Rationale Augmentation Generation (PIKE-RAG), focusing on extracting, understanding, and applying specialized knowledge, while constructing coherent rationale to incrementally steer LLMs toward accurate responses. Recognizing the diverse challenges of industrial tasks, we introduce a new paradigm that classifies tasks based on their complexity in knowledge extraction and application, allowing for a systematic evaluation of RAG systems' problem-solving capabilities. This strategic approach offers a roadmap for the phased development and enhancement of RAG systems, tailored to meet the evolving demands of industrial applications. Furthermore, we propose knowledge atomizing and knowledge-aware task decomposition to effectively extract multifaceted knowledge from the data chunks and iteratively construct the rationale based on original query and the accumulated knowledge, respectively, showcasing exceptional performance across various benchmarks.
中文: PIKE-RAG系统通过引入专业知识提取与逻辑链构建,结合任务分级策略和知识感知的分解方法,有效提升了工业场景中RAG系统的推理能力与适应性。
English: The PIKE-RAG system addresses limitations in current RAG systems by incorporating specialized knowledge extraction and rationale construction, demonstrating superior performance through strategic task classification and knowledge-aware methodologies.

Authors:Leonhard Grosse, Sara Saeidian, Tobias J. Oechtering, Mikael Skoglund
Title: Bounds on the privacy amplification of arbitrary channels via the contraction of $f_α$-divergence
Abstract:
We examine the privacy amplification of channels that do not necessarily satisfy any LDP guarantee by analyzing their contraction behavior in terms of $f_α$-divergence, an $f$-divergence related to Rényi-divergence via a monotonic transformation. We present bounds on contraction for restricted sets of prior distributions via $f$-divergence inequalities and present an improved Pinsker's inequality for $f_α$-divergence based on the joint range technique by Harremoës and Vajda. The presented bound is tight whenever the value of the total variation distance is larger than $1/alpha$. By applying these inequalities in a cross-channel setting, we arrive at strong data processing inequalities for $f_α$-divergence that can be adapted to use-case specific restrictions of input distributions and channel. The application of these results to privacy amplification shows that even very sparse channels can lead to significant privacy amplification when used as a post-processing step after local differentially private mechanisms.
中文: 本研究通过$f_α$-散度收缩分析无LDP保证信道的隐私放大效应,建立了紧致边界和改进不等式,证明即使在稀疏信道作为本地差分隐私机制后处理步骤时也能实现显著隐私增强。
English: This study analyzes the privacy amplification of channels lacking LDP guarantees through $f_α$-divergence contraction, developing tight bounds and improved inequalities that demonstrate significant privacy enhancement even with sparse channels when used after LDP mechanisms.

Authors:Jiwan Chung, Seungwon Lim, Sangkyu Lee, Youngjae Yu
Title: MASS: Overcoming Language Bias in Image-Text Matching
Abstract:
Pretrained visual-language models have made significant advancements in multimodal tasks, including image-text retrieval. However, a major challenge in image-text matching lies in language bias, where models predominantly rely on language priors and neglect to adequately consider the visual content. We thus present Multimodal ASsociation Score (MASS), a framework that reduces the reliance on language priors for better visual accuracy in image-text matching problems. It can be seamlessly incorporated into existing visual-language models without necessitating additional training. Our experiments have shown that MASS effectively lessens language bias without losing an understanding of linguistic compositionality. Overall, MASS offers a promising solution for enhancing image-text matching performance in visual-language models.
中文: 多模态关联评分(MASS)框架通过增强视觉准确性来减少视觉语言模型中的语言偏见,无需额外训练即可无缝集成,同时保持对语言组合性的理解。
English: The Multimodal Association Score (MASS) framework reduces language bias in visual-language models by enhancing visual accuracy for image-text matching, seamlessly integrating without extra training while preserving linguistic understanding.

Authors:Benjamin Reidys, Pantea Zardoshti, Íñigo Goiri, Celine Irvene, Daniel S. Berger, Haoran Ma, Kapil Arya, Eli Cortez, Taylor Stark, Eugene Bak, Mehmet Iyigun, Stanko Novaković, Lisa Hsu, Karel Trueba, Abhisek Pan, Chetan Bansal, Saravan Rajmohan, Jian Huang, Ricardo Bianchini
Title: Coach: Exploiting Temporal Patterns for All-Resource Oversubscription in Cloud Platforms
Abstract:
Cloud platforms remain underutilized despite multiple proposals to improve their utilization (e.g., disaggregation, harvesting, and oversubscription). Our characterization of the resource utilization of virtual machines (VMs) in Azure reveals that, while CPU is the main underutilized resource, we need to provide a solution to manage all resources holistically. We also observe that many VMs exhibit complementary temporal patterns, which can be leveraged to improve the oversubscription of underutilized resources. Based on these insights, we propose Coach: a system that exploits temporal patterns for all-resource oversubscription in cloud platforms. Coach uses long-term predictions and an efficient VM scheduling policy to exploit temporally complementary patterns. We introduce a new general-purpose VM type, called CoachVM, where we partition each resource allocation into a guaranteed and an oversubscribed portion. Coach monitors the oversubscribed resources to detect contention and mitigate any potential performance degradation. We focus on memory management, which is particularly challenging due to memory's sensitivity to contention and the overhead required to reassign it between CoachVMs. Our experiments show that Coach enables platforms to host up to ~26% more VMs with minimal performance degradation.
中文摘要:尽管已有多种策略旨在提升云平台利用率,对Azure虚拟机的研究显示CPU是主要未充分利用的资源,但需整体管理所有资源,因此提出Coach系统,利用时间互补模式实现全资源超售,可在性能影响最小的情况下增加约26%的虚拟机承载量。
English Summary: Despite existing strategies to enhance cloud platform utilization, a study on Azure VMs reveals that CPU is the most underutilized resource, yet a holistic approach is needed, leading to the development of Coach, a system that leverages complementary temporal patterns for all-resource oversubscription, increasing VM capacity by up to 26% with minimal performance impact.

Authors:Giyeong Oh, Saejin Kim, Woohyun Cho, Sangkyu Lee, Jiwan Chung, Dokyung Song, Youngjae Yu
Title: SEAL: Entangled White-box Watermarks on Low-Rank Adaptation
Abstract:
Recently, LoRA and its variants have become the de facto strategy for training and sharing task-specific versions of large pretrained models, thanks to their efficiency and simplicity. However, the issue of copyright protection for LoRA weights, especially through watermark-based techniques, remains underexplored. To address this gap, we propose SEAL (SEcure wAtermarking on LoRA weights), the universal whitebox watermarking for LoRA. SEAL embeds a secret, non-trainable matrix between trainable LoRA weights, serving as a passport to claim ownership. SEAL then entangles the passport with the LoRA weights through training, without extra loss for entanglement, and distributes the finetuned weights after hiding the passport. When applying SEAL, we observed no performance degradation across commonsense reasoning, textual/visual instruction tuning, and text-to-image synthesis tasks. We demonstrate that SEAL is robust against a variety of known attacks: removal, obfuscation, and ambiguity attacks.
中文:SEAL是一种针对LoRA权重的通用白盒水印方法,通过在训练中嵌入不可训练的护照矩阵,确保在各种任务和攻击下实现稳健的版权保护且不损失性能。
English: SEAL is a universal whitebox watermarking method for LoRA weights that embeds a non-trainable passport matrix during training, ensuring robust copyright protection without performance loss across various tasks and attacks.

Authors:Yuxuan Hu, Jing Zhang, Xiaodong Chen, Zhe Zhao, Cuiping Li, Hong Chen
Title: LoRS: Efficient Low-Rank Adaptation for Sparse Large Language Model
Abstract:
Existing low-rank adaptation (LoRA) methods face challenges on sparse large language models (LLMs) due to the inability to maintain sparsity. Recent works introduced methods that maintain sparsity by augmenting LoRA techniques with additional masking mechanisms. Despite these successes, such approaches suffer from an increased memory and computation overhead, which affects efficiency of LoRA methods. In response to this limitation, we introduce LoRS, an innovative method designed to achieve both memory and computation efficiency when fine-tuning sparse LLMs. To mitigate the substantial memory and computation demands associated with preserving sparsity, our approach incorporates strategies of weight recompute and computational graph rearrangement. In addition, we also improve the effectiveness of LoRS through better adapter initialization. These innovations lead to a notable reduction in memory and computation consumption during the fine-tuning phase, all while achieving performance levels that outperform existing LoRA approaches.
Chinese: LoRS是一种高效方法,在稀疏大语言模型微调中通过权重重计算和计算图重组策略,显著降低了内存和计算消耗,同时保持了稀疏性并超越了现有LoRA方法的性能。
English: LoRS is an efficient method that reduces memory and computation overhead while maintaining sparsity during fine-tuning of sparse large language models, outperforming existing LoRA approaches.

Authors:Huy Q. Le, Ye Lin Tun, Yu Qiao, Minh N. H. Nguyen, Keon Oh Kim, Choong Seon Hong
Title: Mitigating Domain Shift in Federated Learning via Intra- and Inter-Domain Prototypes
Abstract:
Federated Learning (FL) has emerged as a decentralized machine learning technique, allowing clients to train a global model collaboratively without sharing private data. However, most FL studies ignore the crucial challenge of heterogeneous domains where each client has a distinct feature distribution, which is popular in real-world scenarios. Prototype learning, which leverages the mean feature vectors within the same classes, has become a prominent solution for federated learning under domain shift. However, existing federated prototype learning methods focus soley on inter-domain prototypes and neglect intra-domain perspectives. In this work, we introduce a novel federated prototype learning method, namely I$^2$PFL, which incorporates $\textbf{I}$ntra-domain and $\textbf{I}$nter-domain $\textbf{P}$rototypes, to mitigate domain shift from both perspectives and learn a generalized global model across multiple domains in federated learning. To construct intra-domain prototypes, we propose feature alignment with MixUp-based augmented prototypes to capture the diversity within local domains and enhance the generalization of local features. Additionally, we introduce a reweighting mechanism for inter-domain prototypes to generate generalized prototypes that reduce domain shift while providing inter-domain knowledge across multiple clients. Extensive experiments on the Digits, Office-10, and PACS datasets illustrate the superior performance of our method compared to other baselines.
中文: 提出的I²PFL方法通过结合域内和域间原型,采用基于MixUp增强的特征对齐和重加权机制,有效缓解联邦学习中的领域偏移问题,从而在多个领域上学习到泛化能力更强的全局模型。
English: The proposed I²PFL method addresses domain shift in federated learning by incorporating both intra-domain and inter-domain prototypes, utilizing feature alignment with MixUp augmentation and a reweighting mechanism to enhance model generalization across multiple domains.

Authors:Huy Q. Le, Ye Lin Tun, Yu Qiao, Minh N. H. Nguyen, Keon Oh Kim, Eui-Nam Huh, Choong Seon Hong
Title: Mitigating Domain Shift in Federated Learning via Intra- and Inter-Domain Prototypes
Abstract:
Federated Learning (FL) has emerged as a decentralized machine learning technique, allowing clients to train a global model collaboratively without sharing private data. However, most FL studies ignore the crucial challenge of heterogeneous domains where each client has a distinct feature distribution, which is popular in real-world scenarios. Prototype learning, which leverages the mean feature vectors within the same classes, has become a prominent solution for federated learning under domain shift. However, existing federated prototype learning methods focus soley on inter-domain prototypes and neglect intra-domain perspectives. In this work, we introduce a novel federated prototype learning method, namely I$^2$PFL, which incorporates $\textbf{I}$ntra-domain and $\textbf{I}$nter-domain $\textbf{P}$rototypes, to mitigate domain shift from both perspectives and learn a generalized global model across multiple domains in federated learning. To construct intra-domain prototypes, we propose feature alignment with MixUp-based augmented prototypes to capture the diversity within local domains and enhance the generalization of local features. Additionally, we introduce a reweighting mechanism for inter-domain prototypes to generate generalized prototypes that reduce domain shift while providing inter-domain knowledge across multiple clients. Extensive experiments on the Digits, Office-10, and PACS datasets illustrate the superior performance of our method compared to other baselines.
中文: 提出的I²PFL方法通过结合域内和域间原型,采用基于MixUp增强的特征对齐和重加权机制,有效缓解联邦学习中的领域偏移问题,从而在多个领域上学习到泛化能力更强的全局模型。
English: The proposed I²PFL method addresses domain shift in federated learning by incorporating both intra-domain and inter-domain prototypes, utilizing feature alignment with MixUp augmentation and a reweighting mechanism to enhance model generalization across multiple domains.

Authors:Oleg Kobzarev, Artem Lykov, Dzmitry Tsetserukou
Title: GestLLM: Advanced Hand Gesture Interpretation via Large Language Models for Human-Robot Interaction
Abstract:
This paper introduces GestLLM, an advanced system for human-robot interaction that enables intuitive robot control through hand gestures. Unlike conventional systems, which rely on a limited set of predefined gestures, GestLLM leverages large language models and feature extraction via MediaPipe to interpret a diverse range of gestures. This integration addresses key limitations in existing systems, such as restricted gesture flexibility and the inability to recognize complex or unconventional gestures commonly used in human communication. By combining state-of-the-art feature extraction and language model capabilities, GestLLM achieves performance comparable to leading vision-language models while supporting gestures underrepresented in traditional datasets. For example, this includes gestures from popular culture, such as the ``Vulcan salute" from Star Trek, without any additional pretraining, prompt engineering, etc. This flexibility enhances the naturalness and inclusivity of robot control, making interactions more intuitive and user-friendly. GestLLM provides a significant step forward in gesture-based interaction, enabling robots to understand and respond to a wide variety of hand gestures effectively. This paper outlines its design, implementation, and evaluation, demonstrating its potential applications in advanced human-robot collaboration, assistive robotics, and interactive entertainment.
中文: GestLLM是一种创新的人机交互系统,它利用大语言模型和MediaPipe特征提取技术来解读多样化手势,通过无需额外训练即可识别复杂及文化特定手势,突破了传统系统的局限性。
English: GestLLM is an innovative human-robot interaction system that uses large language models and MediaPipe feature extraction to interpret diverse hand gestures, overcoming limitations of traditional systems by recognizing complex and culturally specific gestures without additional training.

Authors:Issatay Tokmurziyev, Miguel Altamirano Cabrera, Luis Moreno, Muhammad Haris Khan, Dzmitry Tsetserukou
Title: GazeGrasp: DNN-Driven Robotic Grasping with Wearable Eye-Gaze Interface
Abstract:
We present GazeGrasp, a gaze-based manipulation system enabling individuals with motor impairments to control collaborative robots using eye-gaze. The system employs an ESP32 CAM for eye tracking, MediaPipe for gaze detection, and YOLOv8 for object localization, integrated with a Universal Robot UR10 for manipulation tasks. After user-specific calibration, the system allows intuitive object selection with a magnetic snapping effect and robot control via eye gestures. Experimental evaluation involving 13 participants demonstrated that the magnetic snapping effect significantly reduced gaze alignment time, improving task efficiency by 31%. GazeGrasp provides a robust, hands-free interface for assistive robotics, enhancing accessibility and autonomy for users.
中文: GazeGrasp是一种基于眼动追踪的操控系统,通过磁吸效应和手势控制帮助运动障碍用户操作协作机器人,实验证明其将任务效率提升了31%。
English: GazeGrasp is a gaze-controlled system that enables individuals with motor impairments to operate collaborative robots through eye tracking, significantly improving task efficiency by 31% with its magnetic snapping feature.

Authors:Prashant Kumar, Weiwei Wan, Kensuke Harada
Title: Temperature Driven Multi-modal/Single-actuated Soft Finger
Abstract:
Soft pneumatic fingers are of great research interest. However, their significant potential is limited as most of them can generate only one motion, mostly bending. The conventional design of soft fingers does not allow them to switch to another motion mode. In this paper, we developed a novel multi-modal and single-actuated soft finger where its motion mode is switched by changing the finger's temperature. Our soft finger is capable of switching between three distinctive motion modes: bending, twisting, and extension-in approximately five seconds. We carried out a detailed experimental study of the soft finger and evaluated its repeatability and range of motion. It exhibited repeatability of around one millimeter and a fifty percent larger range of motion than a standard bending actuator. We developed an analytical model for a fiber-reinforced soft actuator for twisting motion. This helped us relate the input pressure to the output twist radius of the twisting motion. This model was validated by experimental verification. Further, a soft robotic gripper with multiple grasp modes was developed using three actuators. This gripper can adapt to and grasp objects of a large range of size, shape, and stiffness. We showcased its grasping capabilities by successfully grasping a small berry, a large roll, and a delicate tofu cube.
中文: 本文开发了一种新型单驱动软体手指,通过温度变化实现在弯曲、扭转和伸展三种运动模式间的切换,具有快速响应和更大运动范围的特点,并应用于多模式抓取的软体机器人夹爪中。
English: This paper introduces a novel single-actuated soft finger that switches between bending, twisting, and extension motions by temperature change, achieving rapid mode transition and enhanced motion range, with applications demonstrated in a multi-grasp robotic gripper.

Authors:Muhamamd Haris Khan, Selamawit Asfaw, Dmitrii Iarchuk, Miguel Altamirano Cabrera, Luis Moreno, Issatay Tokmurziyev, Dzmitry Tsetserukou
Title: Shake-VLA: Vision-Language-Action Model-Based System for Bimanual Robotic Manipulations and Liquid Mixing
Abstract:
This paper introduces Shake-VLA, a Vision-Language-Action (VLA) model-based system designed to enable bimanual robotic manipulation for automated cocktail preparation. The system integrates a vision module for detecting ingredient bottles and reading labels, a speech-to-text module for interpreting user commands, and a language model to generate task-specific robotic instructions. Force Torque (FT) sensors are employed to precisely measure the quantity of liquid poured, ensuring accuracy in ingredient proportions during the mixing process. The system architecture includes a Retrieval-Augmented Generation (RAG) module for accessing and adapting recipes, an anomaly detection mechanism to address ingredient availability issues, and bimanual robotic arms for dexterous manipulation. Experimental evaluations demonstrated a high success rate across system components, with the speech-to-text module achieving a 93% success rate in noisy environments, the vision module attaining a 91% success rate in object and label detection in cluttered environment, the anomaly module successfully identified 95% of discrepancies between detected ingredients and recipe requirements, and the system achieved an overall success rate of 100% in preparing cocktails, from recipe formulation to action generation.
中文: Shake-VLA系统通过整合视觉、语音识别和语言模块,使双手机器人能自主调制鸡尾酒,成功检测原料、解析指令并生成动作,各组件均表现优异,整体调制成功率高达100%。
English: The Shake-VLA system enables bimanual robots to prepare cocktails autonomously by integrating vision, speech recognition, and language models to detect ingredients, interpret commands, and generate actions, achieving high success rates across all components including 100% overall cocktail preparation accuracy.

Authors:Ashitha Mudraje, Brian B. Moser, Stanislav Frolov, Andreas Dengel
Title: Multi-Label Scene Classification in Remote Sensing Benefits from Image Super-Resolution
Abstract:
Satellite imagery is a cornerstone for numerous Remote Sensing (RS) applications; however, limited spatial resolution frequently hinders the precision of such systems, especially in multi-label scene classification tasks as it requires a higher level of detail and feature differentiation. In this study, we explore the efficacy of image Super-Resolution (SR) as a pre-processing step to enhance the quality of satellite images and thus improve downstream classification performance. We investigate four SR models - SRResNet, HAT, SeeSR, and RealESRGAN - and evaluate their impact on multi-label scene classification across various CNN architectures, including ResNet-50, ResNet-101, ResNet-152, and Inception-v4. Our results show that applying SR significantly improves downstream classification performance across various metrics, demonstrating its ability to preserve spatial details critical for multi-label tasks. Overall, this work offers valuable insights into the selection of SR techniques for multi-label prediction in remote sensing and presents an easy-to-integrate framework to improve existing RS systems.
中文摘要:本研究证明,将图像超分辨率作为预处理步骤可显著提升卫星图像质量,并在多种神经网络架构中有效改善多标签场景分类性能。
English Summary: This study demonstrates that applying image super-resolution as a preprocessing step significantly enhances satellite image quality and improves multi-label scene classification performance across various neural network architectures.

Authors:Yinfang Chen, Manish Shetty, Gagan Somashekar, Minghua Ma, Yogesh Simmhan, Jonathan Mace, Chetan Bansal, Rujia Wang, Saravan Rajmohan
Title: AIOpsLab: A Holistic Framework to Evaluate AI Agents for Enabling Autonomous Clouds
Abstract:
AI for IT Operations (AIOps) aims to automate complex operational tasks, such as fault localization and root cause analysis, to reduce human workload and minimize customer impact. While traditional DevOps tools and AIOps algorithms often focus on addressing isolated operational tasks, recent advances in Large Language Models (LLMs) and AI agents are revolutionizing AIOps by enabling end-to-end and multitask automation. This paper envisions a future where AI agents autonomously manage operational tasks throughout the entire incident lifecycle, leading to self-healing cloud systems, a paradigm we term AgentOps. Realizing this vision requires a comprehensive framework to guide the design, development, and evaluation of these agents. To this end, we present AIOPSLAB, a framework that not only deploys microservice cloud environments, injects faults, generates workloads, and exports telemetry data but also orchestrates these components and provides interfaces for interacting with and evaluating agents. We discuss the key requirements for such a holistic framework and demonstrate how AIOPSLAB can facilitate the evaluation of next-generation AIOps agents. Through evaluations of state-of-the-art LLM agents within the benchmark created by AIOPSLAB, we provide insights into their capabilities and limitations in handling complex operational tasks in cloud environments.
中文: 本文提出AIOPSLAB框架,旨在通过AI代理(称为AgentOps)实现端到端自动化,推动AIOps发展,使云系统能够自主管理运维任务并自我修复,同时评估揭示了先进大语言模型代理在复杂云环境中处理运维任务的能力与局限。
English: This paper introduces AIOPSLAB, a comprehensive framework designed to advance AIOps by enabling end-to-end automation through AI agents, termed AgentOps, which autonomously manage operational tasks and self-healing cloud systems, with evaluations revealing the capabilities and limitations of state-of-the-art LLM agents in complex cloud environments.

Authors:Oleg Sautenkov, Yasheerah Yaqoot, Artem Lykov, Muhammad Ahsan Mustafa, Grik Tadevosyan, Aibek Akhmetkazy, Miguel Altamirano Cabrera, Mikhail Martynov, Sausar Karaf, Dzmitry Tsetserukou
Title: UAV-VLA: Vision-Language-Action System for Large Scale Aerial Mission Generation
Abstract:
The UAV-VLA (Visual-Language-Action) system is a tool designed to facilitate communication with aerial robots. By integrating satellite imagery processing with the Visual Language Model (VLM) and the powerful capabilities of GPT, UAV-VLA enables users to generate general flight paths-and-action plans through simple text requests. This system leverages the rich contextual information provided by satellite images, allowing for enhanced decision-making and mission planning. The combination of visual analysis by VLM and natural language processing by GPT can provide the user with the path-and-action set, making aerial operations more efficient and accessible. The newly developed method showed the difference in the length of the created trajectory in 22% and the mean error in finding the objects of interest on a map in 34.22 m by Euclidean distance in the K-Nearest Neighbors (KNN) approach.
中文:UAV-VLA系统结合卫星图像与视觉语言模型,通过文本请求生成飞行路径和行动计划,利用KNN方法在轨迹长度上差异为22%,目标检测平均误差为34.22米,提升了空中任务效率。
English: The UAV-VLA system integrates satellite imagery with visual and language models to enable users to generate flight paths and action plans through text requests, improving aerial mission efficiency with a 22% trajectory length difference and a 34.22 m mean error in object detection using KNN.

Authors:Zhiyuan Wang, Shengcai Liu, Peng Yang, Ke Tang
Title: Evolving Generalizable Parallel Algorithm Portfolios via Domain-Agnostic Instance Generation
Abstract:
Generalization is the core objective when training optimizers from data. However, limited training instances often constrain the generalization capability of the trained optimizers. Co-evolutionary approaches address this challenge by simultaneously evolving a parallel algorithm portfolio (PAP) and an instance population to eventually obtain PAPs with good generalization. Yet, when applied to a specific problem class, these approaches have a major limitation. They require practitioners to provide instance generators specially tailored to the problem class, which is often non-trivial to design. This work proposes a general-purpose, off-the-shelf PAP construction approach, named domain-agnostic co-evolution of parameterized search (DACE), for binary optimization problems where decision variables take values of 0 or 1. The key novelty of DACE lies in its neural network-based domain-agnostic instance representation and generation mechanism that eliminates the need for domain-specific instance generators. The strong generality of DACE is validated across three real-world binary optimization problems: the complementary influence maximization problem (CIMP), the compiler arguments optimization problem (CAOP), and the contamination control problem (CCP). Given only a small set of training instances from these problem classes, DACE, without requiring domain knowledge, constructs PAPs with even better generalization performance than existing approaches on all three classes, despite their use of domain-specific instance generators.
中文: 本研究提出DACE,一种领域无关的协同进化方法,通过基于神经网络的实例表示和生成机制为二元优化问题构建并行算法组合,无需领域特定生成器,并在三个实际问题上展现出更优的泛化性能。
English: This work introduces DACE, a domain-agnostic co-evolutionary approach that constructs parallel algorithm portfolios for binary optimization problems using neural network-based instance representation and generation, eliminating the need for domain-specific generators and demonstrating superior generalization across three real-world problems.

Authors:Tsz Kin Lam, Marco Gaido, Sara Papi, Luisa Bentivogli, Barry Haddow
Title: Prepending or Cross-Attention for Speech-to-Text? An Empirical Comparison
Abstract:
Following the remarkable success of Large Language Models (LLMs) in NLP tasks, there is increasing interest in extending their capabilities to speech -- the most common form of communication. The most widespread approach to integrating speech into LLMs is dense feature prepending (DFP), which prepends the projected speech representations to the textual representations, allowing end-to-end training with a speech encoder. This raises questions about the need for a sophisticated speech encoder for DFP and how its performance compares with a standard encoder-decoder (i.e., cross-attention) architecture. We compare DFP and cross-attention under a variety of configurations, such as CTC compression, sequence-level knowledge distillation, on monolingual, bilingual, and multilingual models. To perform a controlled architectural comparison, we train all models from scratch rather than using large pretrained models and use comparable data and parameter settings, testing speech-to-text recognition (ASR) and translation (ST) on MuST-C v1.0 and CoVoST2 datasets. Despite the wide adoption of DFP, our results do not indicate a clear advantage of DFP over cross-attention.
Chinese: 本研究比较了将语音集成到大型语言模型中的密集特征前置与交叉注意力架构,发现在语音转文本任务中,尽管前者应用广泛,但并未表现出明显优势。
English: This study compares dense feature prepending (DFP) and cross-attention architectures for integrating speech into large language models, finding no clear advantage for DFP in speech-to-text tasks despite its widespread use.

Authors:Jianjie Luo, Jingwen Chen, Yehao Li, Yingwei Pan, Jianlin Feng, Hongyang Chao, Ting Yao
Title: Unleashing Text-to-Image Diffusion Prior for Zero-Shot Image Captioning
Abstract:
Recently, zero-shot image captioning has gained increasing attention, where only text data is available for training. The remarkable progress in text-to-image diffusion model presents the potential to resolve this task by employing synthetic image-caption pairs generated by this pre-trained prior. Nonetheless, the defective details in the salient regions of the synthetic images introduce semantic misalignment between the synthetic image and text, leading to compromised results. To address this challenge, we propose a novel Patch-wise Cross-modal feature Mix-up (PCM) mechanism to adaptively mitigate the unfaithful contents in a fine-grained manner during training, which can be integrated into most of encoder-decoder frameworks, introducing our PCM-Net. Specifically, for each input image, salient visual concepts in the image are first detected considering the image-text similarity in CLIP space. Next, the patch-wise visual features of the input image are selectively fused with the textual features of the salient visual concepts, leading to a mixed-up feature map with less defective content. Finally, a visual-semantic encoder is exploited to refine the derived feature map, which is further incorporated into the sentence decoder for caption generation. Additionally, to facilitate the model training with synthetic data, a novel CLIP-weighted cross-entropy loss is devised to prioritize the high-quality image-text pairs over the low-quality counterparts. Extensive experiments on MSCOCO and Flickr30k datasets demonstrate the superiority of our PCM-Net compared with state-of-the-art VLMs-based approaches. It is noteworthy that our PCM-Net ranks first in both in-domain and cross-domain zero-shot image captioning. The synthetic dataset SynthImgCap and code are available at https://jianjieluo.github.io/SynthImgCap.
中文摘要:提出的PCM-Net通过逐块跨模态特征混合机制,在零样本图像描述任务中自适应融合文本特征以修正合成图像的语义偏差,在基准数据集上实现了最优性能。
English Summary: The proposed PCM-Net introduces a patch-wise cross-modal feature mix-up mechanism to address semantic misalignment in zero-shot image captioning by adaptively refining synthetic image features with textual concepts, achieving state-of-the-art performance on benchmark datasets.

Authors:Ashay Athalye, Nishanth Kumar, Tom Silver, Yichao Liang, Jiuguang Wang, Tomás Lozano-Pérez, Leslie Pack Kaelbling
Title: From Pixels to Predicates: Learning Symbolic World Models via Pretrained Vision-Language Models
Abstract:
Our aim is to learn to solve long-horizon decision-making problems in complex robotics domains given low-level skills and a handful of short-horizon demonstrations containing sequences of images. To this end, we focus on learning abstract symbolic world models that facilitate zero-shot generalization to novel goals via planning. A critical component of such models is the set of symbolic predicates that define properties of and relationships between objects. In this work, we leverage pretrained vision language models (VLMs) to propose a large set of visual predicates potentially relevant for decision-making, and to evaluate those predicates directly from camera images. At training time, we pass the proposed predicates and demonstrations into an optimization-based model-learning algorithm to obtain an abstract symbolic world model that is defined in terms of a compact subset of the proposed predicates. At test time, given a novel goal in a novel setting, we use the VLM to construct a symbolic description of the current world state, and then use a search-based planning algorithm to find a sequence of low-level skills that achieves the goal. We demonstrate empirically across experiments in both simulation and the real world that our method can generalize aggressively, applying its learned world model to solve problems with a wide variety of object types, arrangements, numbers of objects, and visual backgrounds, as well as novel goals and much longer horizons than those seen at training time.
中文: 本研究利用视觉语言模型从少量演示中学习抽象符号世界模型,通过规划实现机器人任务在多样化环境中的零样本泛化。
English: This research develops a method to learn abstract symbolic world models from short demonstrations using vision language models, enabling zero-shot generalization to novel robotic tasks through planning in diverse environments.

Authors:Shuyin Xia, Xiaoyu Lian, Binbin Sang, Guoyin Wang, Xinbo Gao
Title: GBFRS: Robust Fuzzy Rough Sets via Granular-ball Computing
Abstract:
Fuzzy rough set theory is effective for processing datasets with complex attributes, supported by a solid mathematical foundation and closely linked to kernel methods in machine learning. Attribute reduction algorithms and classifiers based on fuzzy rough set theory exhibit promising performance in the analysis of high-dimensional multivariate complex data. However, most existing models operate at the finest granularity, rendering them inefficient and sensitive to noise, especially for high-dimensional big data. Thus, enhancing the robustness of fuzzy rough set models is crucial for effective feature selection. Muiti-garanularty granular-ball computing, a recent development, uses granular-balls of different sizes to adaptively represent and cover the sample space, performing learning based on these granular-balls. This paper proposes integrating multi-granularity granular-ball computing into fuzzy rough set theory, using granular-balls to replace sample points. The coarse-grained characteristics of granular-balls make the model more robust. Additionally, we propose a new method for generating granular-balls, scalable to the entire supervised method based on granular-ball computing. A forward search algorithm is used to select feature sequences by defining the correlation between features and categories through dependence functions. Experiments demonstrate the proposed model's effectiveness and superiority over baseline methods.
中文摘要:本文通过将多粒度粒球计算融入模糊粗糙集理论,提出了一种鲁棒性模型,用粒球替代样本点以提高高维数据特征选择的抗噪性和效率。
English Summary: This paper introduces a robust fuzzy rough set model by integrating multi-granularity granular-ball computing, which replaces sample points with granular-balls to enhance noise resistance and efficiency in feature selection for high-dimensional data.

Authors:Le Xia, Yao Sun, Haijian Sun, Rose Qingyang Hu, Dusit Niyato, Muhammad Ali Imran
Title: Joint Power and Spectrum Orchestration for D2D Semantic Communication Underlying Energy-Efficient Cellular Networks
Abstract:
Semantic communication (SemCom) has been recently deemed a promising next-generation wireless technique to enable efficient spectrum savings and information exchanges, thus naturally introducing a novel and practical network paradigm where cellular and device-to-device (D2D) SemCom approaches coexist. Nevertheless, the involved wireless resource management becomes complicated and challenging due to the unique semantic performance measurements and energy-consuming semantic coding mechanism. To this end, this paper jointly investigates power control and spectrum reuse problems for energy-efficient D2D SemCom cellular networks. Concretely, we first model the user preference-aware semantic triplet transmission and leverage a novel metric of semantic value to identify the semantic information importance conveyed in SemCom. Then, we define the additional power consumption from semantic encoding in conjunction with basic power amplifier dissipation to derive the overall system energy efficiency (semantic-value/Joule). Next, we formulate an energy efficiency maximization problem for joint power and spectrum allocation subject to several SemCom-related and practical constraints. Afterward, we propose an optimal resource management solution by employing the fractional-to-subtractive problem transformation and decomposition while developing a three-stage method with theoretical analysis of its optimality guarantee and computational complexity. Numerical results demonstrate the adequate performance superiority of our proposed solution compared with different benchmarks.
中文: 语义通信通过融合蜂窝与设备间通信提升无线效率,但需创新资源管理以优化能耗与性能,本文通过联合功率与频谱分配方案解决了这一问题。
English: Semantic communication enhances wireless efficiency by integrating cellular and device-to-device approaches, but requires innovative resource management to optimize energy use and performance, addressed here through a joint power and spectrum allocation solution.

Authors:Georgios Andreadis, Eduard Ruiz Munné, Thomas H. W. Bäck, Peter A. N. Bosman, Tanja Alderliesten
Title: Multi-Objective Deep-Learning-based Biomechanical Deformable Image Registration with MOREA
Abstract:
When choosing a deformable image registration (DIR) approach for images with large deformations and content mismatch, the realism of found transformations often needs to be traded off against the required runtime. DIR approaches using deep learning (DL) techniques have shown remarkable promise in instantly predicting a transformation. However, on difficult registration problems, the realism of these transformations can fall short. DIR approaches using biomechanical, finite element modeling (FEM) techniques can find more realistic transformations, but tend to require much longer runtimes. This work proposes the first hybrid approach to combine them, with the aim of getting the best of both worlds. This hybrid approach, called DL-MOREA, combines a recently introduced multi-objective DL-based DIR approach which leverages the VoxelMorph framework, called DL-MODIR, with MOREA, an evolutionary algorithm-based, multi-objective DIR approach in which a FEM-like biomechanical mesh transformation model is used. In our proposed hybrid approach, the DL results are used to smartly initialize MOREA, with the aim of more efficiently optimizing its mesh transformation model. We empirically compare DL-MOREA against its components, DL-MODIR and MOREA, on CT scan pairs capturing large bladder filling differences of 15 cervical cancer patients. While MOREA requires a median runtime of 45 minutes, DL-MOREA can already find high-quality transformations after 5 minutes. Compared to the DL-MODIR transformations, the transformations found by DL-MOREA exhibit far less folding and improve or preserve the bladder contour distance error.
Chinese: 本研究提出DL-MOREA混合方法,通过深度学习快速初始化结合生物力学建模,在保持膀胱轮廓精度的同时,将配准时间从45分钟缩短至5分钟,有效平衡了变形图像配准的实时性与准确性需求。
English: This study introduces DL-MOREA, a hybrid approach that combines deep learning for rapid initialization with biomechanical modeling to achieve realistic deformable image registration, significantly reducing runtime while improving transformation quality compared to individual methods.

Authors:Emad Efatinasab, Alessandro Brighente, Denis Donadel, Mauro Conti, Mirco Rampazzo
Title: Towards Robust Stability Prediction in Smart Grids: GAN-based Approach under Data Constraints and Adversarial Challenges
Abstract:
Smart grids are crucial for meeting rising energy demands driven by global population growth and urbanization. By integrating renewable energy sources, they enhance efficiency, reliability, and sustainability. However, ensuring their availability and security requires advanced operational control and safety measures. Although artificial intelligence and machine learning can help assess grid stability, challenges such as data scarcity and cybersecurity threats, particularly adversarial attacks, remain. Data scarcity is a major issue, as obtaining real-world instances of grid instability requires significant expertise, resources, and time. Yet, these instances are critical for testing new research advancements and security mitigations. This paper introduces a novel framework for detecting instability in smart grids using only stable data. It employs a Generative Adversarial Network (GAN) where the generator is designed not to produce near-realistic data but instead to generate Out-Of-Distribution (OOD) samples with respect to the stable class. These OOD samples represent unstable behavior, anomalies, or disturbances that deviate from the stable data distribution. By training exclusively on stable data and exposing the discriminator to OOD samples, our framework learns a robust decision boundary to distinguish stable conditions from any unstable behavior, without requiring unstable data during training. Furthermore, we incorporate an adversarial training layer to enhance resilience against attacks. Evaluated on a real-world dataset, our solution achieves up to 98.1\% accuracy in predicting grid stability and 98.9\% in detecting adversarial attacks. Implemented on a single-board computer, it enables real-time decision-making with an average response time of under 7ms.
Chinese: 本文提出了一种仅使用稳定数据检测智能电网不稳定的新框架,通过生成对抗网络产生分布外样本来代表不稳定行为,在稳定预测和对抗攻击检测方面达到高精度,并实现实时决策。
English: This paper introduces a novel framework that uses only stable data to detect smart grid instability by employing a Generative Adversarial Network to generate Out-Of-Distribution samples, achieving high accuracy in stability prediction and adversarial attack detection with real-time performance.

Authors:Zijian Yang, Vahe Eminyan, Ralf Schlüter, Hermann Ney
Title: Classification Error Bound for Low Bayes Error Conditions in Machine Learning
Abstract:
In statistical classification and machine learning, classification error is an important performance measure, which is minimized by the Bayes decision rule. In practice, the unknown true distribution is usually replaced with a model distribution estimated from the training data in the Bayes decision rule. This substitution introduces a mismatch between the Bayes error and the model-based classification error. In this work, we apply classification error bounds to study the relationship between the error mismatch and the Kullback-Leibler divergence in machine learning. Motivated by recent observations of low model-based classification errors in many machine learning tasks, bounding the Bayes error to be lower, we propose a linear approximation of the classification error bound for low Bayes error conditions. Then, the bound for class priors are discussed. Moreover, we extend the classification error bound for sequences. Using automatic speech recognition as a representative example of machine learning applications, this work analytically discusses the correlations among different performance measures with extended bounds, including cross-entropy loss, language model perplexity, and word error rate.
中文: 本研究通过建立分类误差界限探讨机器学习中误差失配与K-L散度的关系,针对低贝叶斯误差条件提出线性近似方法,并将界限扩展至序列分析,同时以自动语音识别为例,解析了交叉熵损失与词错误率等性能指标间的关联性。
English: This study explores the relationship between error mismatch and Kullback-Leibler divergence in machine learning by developing classification error bounds, proposing a linear approximation for low Bayes error conditions, and extending these bounds to sequences while analyzing correlations with performance measures like cross-entropy loss and word error rate in applications such as automatic speech recognition.

Authors:Shuo Shao, Haozhe Zhu, Yiming Li, Hongwei Yao, Tianwei Zhang, Zhan Qin
Title: FIT-Print: Towards False-claim-resistant Model Ownership Verification via Targeted Fingerprint
Abstract:
Model fingerprinting is a widely adopted approach to safeguard the intellectual property rights of open-source models by preventing their unauthorized reuse. It is promising and convenient since it does not necessitate modifying the protected model. In this paper, we revisit existing fingerprinting methods and reveal that they are vulnerable to false claim attacks where adversaries falsely assert ownership of any third-party model. We demonstrate that this vulnerability mostly stems from their untargeted nature, where they generally compare the outputs of given samples on different models instead of the similarities to specific references. Motivated by these findings, we propose a targeted fingerprinting paradigm (i.e., FIT-Print) to counteract false claim attacks. Specifically, FIT-Print transforms the fingerprint into a targeted signature via optimization. Building on the principles of FIT-Print, we develop bit-wise and list-wise black-box model fingerprinting methods, i.e., FIT-ModelDiff and FIT-LIME, which exploit the distance between model outputs and the feature attribution of specific samples as the fingerprint, respectively. Extensive experiments on benchmark models and datasets verify the effectiveness, conferrability, and resistance to false claim attacks of our FIT-Print.
中文摘要:模型指纹识别虽能保护开源模型免受未经授权使用,但易受虚假声明攻击;为此提出的FIT-Print方法通过将指纹转化为目标签名,有效抵御此类攻击,并开发了FIT-ModelDiff和FIT-LIME两种具体实现验证其有效性。
English Summary: Model fingerprinting protects open-source models from unauthorized use but is vulnerable to false claims, prompting the development of FIT-Print, a targeted method that transforms fingerprints into signatures to counter these attacks effectively.

Authors:Yipeng Liu, Qi Yang, Yiling Xu
Title: Differentiable Low-computation Global Correlation Loss for Monotonicity Evaluation in Quality Assessment
Abstract:
In this paper, we propose a global monotonicity consistency training strategy for quality assessment, which includes a differentiable, low-computation monotonicity evaluation loss function and a global perception training mechanism. Specifically, unlike conventional ranking loss and linear programming approaches that indirectly implement the Spearman rank-order correlation coefficient (SROCC) function, our method directly converts SROCC into a loss function by making the sorting operation within SROCC differentiable and functional. Furthermore, to mitigate the discrepancies between batch optimization during network training and global evaluation of SROCC, we introduce a memory bank mechanism. This mechanism stores gradient-free predicted results from previous batches and uses them in the current batch's training to prevent abrupt gradient changes. We evaluate the performance of the proposed method on both images and point clouds quality assessment tasks, demonstrating performance gains in both cases.
Chinese: 本文提出了一种用于质量评估的全局单调一致性训练策略,通过可微单调性损失函数和记忆库机制,在图像和点云质量评估任务中均实现了性能提升。
English: This paper introduces a global monotonicity consistency training strategy for quality assessment, featuring a differentiable monotonicity loss and a memory bank mechanism to enhance performance in both image and point cloud evaluation tasks.

Authors:Tianrui Wang, Meng Ge, Cheng Gong, Chunyu Qiang, Haoyu Wang, Zikang Huang, Yu Jiang, Xiaobao Wang, Xie Chen, Longbiao Wang, Jianwu Dang
Title: Characteristic-Specific Partial Fine-Tuning for Efficient Emotion and Speaker Adaptation in Codec Language Text-to-Speech Models
Abstract:
Recently, emotional speech generation and speaker cloning have garnered significant interest in text-to-speech (TTS). With the open-sourcing of codec language TTS models trained on massive datasets with large-scale parameters, adapting these general pre-trained TTS models to generate speech with specific emotional expressions and target speaker characteristics has become a topic of great attention. Common approaches, such as full and adapter-based fine-tuning, often overlook the specific contributions of model parameters to emotion and speaker control. Treating all parameters uniformly during fine-tuning, especially when the target data has limited content diversity compared to the pre-training corpus, results in slow training speed and an increased risk of catastrophic forgetting. To address these challenges, we propose a characteristic-specific partial fine-tuning strategy, short as CSP-FT. First, we use a weighted-sum approach to analyze the contributions of different Transformer layers in a pre-trained codec language TTS model for emotion and speaker control in the generated speech. We then selectively fine-tune the layers with the highest and lowest characteristic-specific contributions to generate speech with target emotional expression and speaker identity. Experimental results demonstrate that our method achieves performance comparable to, or even surpassing, full fine-tuning in generating speech with specific emotional expressions and speaker identities. Additionally, CSP-FT delivers approximately 2x faster training speeds, fine-tunes only around 8% of parameters, and significantly reduces catastrophic forgetting. Furthermore, we show that codec language TTS models perform competitively with self-supervised models in speaker identification and emotion classification tasks, offering valuable insights for developing universal speech processing models.
中文摘要:针对语音合成中情感表达和说话人克隆的需求,现有微调方法存在训练缓慢和灾难性遗忘问题;我们提出的CSP-FT策略通过选择性微调关键模型层,在实现优异性能的同时显著提升训练效率并减少参数更新。
English Summary: Recent advances in text-to-speech (TTS) have enabled emotional speech generation and speaker cloning, but current fine-tuning methods often cause slow training and catastrophic forgetting; our proposed CSP-FT strategy selectively fine-tunes key model layers to achieve superior performance with faster training and reduced parameter updates.

Authors:Benjamin Hou, Tejas Sudharshan Mathai, Pritam Mukherjee, Xinya Wang, Ronald M. Summers, Zhiyong Lu
Title: Segment-and-Classify: ROI-Guided Generalizable Contrast Phase Classification in CT Using XGBoost
Abstract:
Purpose: To automate contrast phase classification in CT using organ-specific features extracted from a widely used segmentation tool with a lightweight decision tree classifier. Materials and Methods: This retrospective study utilized three public CT datasets from separate institutions. The phase prediction model was trained on the WAW-TACE (median age: 66 [60,73]; 185 males) dataset, and externally validated on the VinDr-Multiphase (146 males; 63 females; 56 unk) and C4KC-KiTS (median age: 61 [50.68; 123 males) datasets. Contrast phase classification was performed using organ-specific features extracted by TotalSegmentator, followed by prediction using a gradient-boosted decision tree classifier. Results: On the VinDr-Multiphase dataset, the phase prediction model achieved the highest or comparable AUCs across all phases (>0.937), with superior F1-scores in the non-contrast (0.994), arterial (0.937), and delayed (0.718) phases. Statistical testing indicated significant performance differences only in the arterial and delayed phases (p<0.05). On the C4KC-KiTS dataset, the phase prediction model achieved the highest AUCs across all phases (>0.991), with superior F1-scores in arterial/venous (0.968) and delayed (0.935) phases. Statistical testing confirmed significant improvements over all baseline models in these two phases (p<0.05). Performance in the non-contrast class, however, was comparable across all models, with no statistically significant differences observed (p>0.05). Conclusion: The lightweight model demonstrated strong performance relative to all baseline models, and exhibited robust generalizability across datasets from different institutions.
中文: 本研究利用TotalSegmentator提取器官特异性特征,通过梯度提升决策树开发了自动化CT增强时相分类系统,在多个公共数据集上展现出优异性能及跨机构泛化能力。
English: This study developed an automated CT contrast phase classification system using organ-specific features from TotalSegmentator and a gradient-boosted decision tree, demonstrating strong performance and cross-institutional generalizability across multiple public datasets.

Authors:Xinya Wang, Tejas Sudharshan Mathai, Boah Kim, Ronald M. Summers
Title: Leveraging Multiphase CT for Quality Enhancement of Portal Venous CT: Utility for Pancreas Segmentation
Abstract:
Multiphase CT studies are routinely obtained in clinical practice for diagnosis and management of various diseases, such as cancer. However, the CT studies can be acquired with low radiation doses, different scanners, and are frequently affected by motion and metal artifacts. Prior approaches have targeted the quality improvement of one specific CT phase (e.g., non-contrast CT). In this work, we hypothesized that leveraging multiple CT phases for the quality enhancement of one phase may prove advantageous for downstream tasks, such as segmentation. A 3D progressive fusion and non-local (PFNL) network was developed. It was trained with three degraded (low-quality) phases (non-contrast, arterial, and portal venous) to enhance the quality of the portal venous phase. Then, the effect of scan quality enhancement was evaluated using a proxy task of pancreas segmentation, which is useful for tracking pancreatic cancer. The proposed approach improved the pancreas segmentation by 3% over the corresponding low-quality CT scan. To the best of our knowledge, we are the first to harness multiphase CT for scan quality enhancement and improved pancreas segmentation.
中文: 本研究开发了一种三维渐进融合非局部网络,通过利用多期CT扫描信息来提升门静脉期图像质量,使胰腺分割准确率提高了3%。
English: This study introduces a 3D progressive fusion and non-local network that enhances the quality of portal venous CT scans by leveraging information from multiple CT phases, resulting in a 3% improvement in pancreas segmentation accuracy.

Authors:Lei Lan, Tianjia Shao, Zixuan Lu, Yu Zhang, Chenfanfu Jiang, Yin Yang
Title: 3DGS$^2$: Near Second-order Converging 3D Gaussian Splatting
Abstract:
3D Gaussian Splatting (3DGS) has emerged as a mainstream solution for novel view synthesis and 3D reconstruction. By explicitly encoding a 3D scene using a collection of Gaussian kernels, 3DGS achieves high-quality rendering with superior efficiency. As a learning-based approach, 3DGS training has been dealt with the standard stochastic gradient descent (SGD) method, which offers at most linear convergence. Consequently, training often requires tens of minutes, even with GPU acceleration. This paper introduces a (near) second-order convergent training algorithm for 3DGS, leveraging its unique properties. Our approach is inspired by two key observations. First, the attributes of a Gaussian kernel contribute independently to the image-space loss, which endorses isolated and local optimization algorithms. We exploit this by splitting the optimization at the level of individual kernel attributes, analytically constructing small-size Newton systems for each parameter group, and efficiently solving these systems on GPU threads. This achieves Newton-like convergence per training image without relying on the global Hessian. Second, kernels exhibit sparse and structured coupling across input images. This property allows us to effectively utilize spatial information to mitigate overshoot during stochastic training. Our method converges an order faster than standard GPU-based 3DGS training, requiring over $10\times$ fewer iterations while maintaining or surpassing the quality of the compared with the SGD-based 3DGS reconstructions.
Chinese: 本文提出了一种针对3D高斯泼溅的近二阶收敛训练算法,相比基于SGD的标准方法实现了超过10倍的加速收敛,同时保持或提升了重建质量。
English: This paper introduces a near second-order convergent training algorithm for 3D Gaussian Splatting that achieves over 10× faster convergence than standard SGD-based methods while maintaining or improving reconstruction quality.

Authors:Yipeng Liu, Qi Yang, Yujie Zhang, Yiling Xu, Le Yang, Zhu Li
Title: From Images to Point Clouds: An Efficient Solution for Cross-media Blind Quality Assessment without Annotated Training
Abstract:
We present a novel quality assessment method which can predict the perceptual quality of point clouds from new scenes without available annotations by leveraging the rich prior knowledge in images, called the Distribution-Weighted Image-Transferred Point Cloud Quality Assessment (DWIT-PCQA). Recognizing the human visual system (HVS) as the decision-maker in quality assessment regardless of media types, we can emulate the evaluation criteria for human perception via neural networks and further transfer the capability of quality prediction from images to point clouds by leveraging the prior knowledge in the images. Specifically, domain adaptation (DA) can be leveraged to bridge the images and point clouds by aligning feature distributions of the two media in the same feature space. However, the different manifestations of distortions in images and point clouds make feature alignment a difficult task. To reduce the alignment difficulty and consider the different distortion distribution during alignment, we have derived formulas to decompose the optimization objective of the conventional DA into two suboptimization functions with distortion as a transition. Specifically, through network implementation, we propose the distortion-guided biased feature alignment which integrates existing/estimated distortion distribution into the adversarial DA framework, emphasizing common distortion patterns during feature alignment. Besides, we propose the quality-aware feature disentanglement to mitigate the destruction of the mapping from features to quality during alignment with biased distortions. Experimental results demonstrate that our proposed method exhibits reliable performance compared to general blind PCQA methods without needing point cloud annotations.
中文摘要:本文提出DWIT-PCQA方法,通过域适配将图像质量评估能力迁移至点云,采用失真引导的特征对齐和质量感知特征解耦技术,在无需点云标注的情况下实现了可靠的感知质量预测。
English Summary: The paper introduces DWIT-PCQA, a novel quality assessment method that transfers image-based quality prediction capabilities to point clouds through domain adaptation, using distortion-guided feature alignment and quality-aware feature disentanglement to achieve reliable performance without point cloud annotations.

Authors:Jacopo Nudo, Matteo Cinelli, Andrea Baronchelli, Walter Quattrociocchi
Title: From Niche to Mainstream: Community Size and Engagement in Social Media Conversations
Abstract:
The architecture of public discourse has been profoundly reshaped by social media platforms, which mediate interactions at an unprecedented scale and complexity. This study analyzes user behavior across six platforms over 33 years, exploring how the size of conversations and communities influences dialogue dynamics. Our findings reveal that smaller platforms foster richer, more sustained interactions, while larger platforms drive broader but shorter participation. Moreover, we observe that the propensity for users to re-engage in a conversation decreases as community size grows, with niche environments as a notable exception, where participation remains robust. These findings show an interdependence between platform architecture, user engagement, and community dynamics, shedding light on how digital ecosystems shape the structure and quality of public discourse.
中文: 社交媒体重塑公共话语结构,研究表明小型平台促进更深入持久的互动,而大型平台推动广泛但短暂的参与,且用户再参与度随社区规模扩大而下降,仅小众环境例外。
English: Social media platforms reshape public discourse by revealing that smaller communities foster richer, sustained interactions, while larger ones encourage broader but shorter engagement, with user re-engagement declining as community size increases except in niche environments.

Authors:Jun Xu, Zhengxue Cheng, Guangchuan Chi, Yuhan Liu, Yuelin Hu, Li Song
Title: Rate-Aware Learned Speech Compression
Abstract:
The rapid rise of real-time communication and large language models has significantly increased the importance of speech compression. Deep learning-based neural speech codecs have outperformed traditional signal-level speech codecs in terms of rate-distortion (RD) performance. Typically, these neural codecs employ an encoder-quantizer-decoder architecture, where audio is first converted into latent code feature representations and then into discrete tokens. However, this architecture exhibits insufficient RD performance due to two main drawbacks: (1) the inadequate performance of the quantizer, challenging training processes, and issues such as codebook collapse; (2) the limited representational capacity of the encoder and decoder, making it difficult to meet feature representation requirements across various bitrates. In this paper, we propose a rate-aware learned speech compression scheme that replaces the quantizer with an advanced channel-wise entropy model to improve RD performance, simplify training, and avoid codebook collapse. We employ multi-scale convolution and linear attention mixture blocks to enhance the representational capacity and flexibility of the encoder and decoder. Experimental results demonstrate that the proposed method achieves state-of-the-art RD performance, obtaining 53.51% BD-Rate bitrate saving in average, and achieves 0.26 BD-VisQol and 0.44 BD-PESQ gains.
中文: 该研究提出的速率感知学习语音压缩方案通过通道熵模型替代量化器,并采用多尺度卷积和线性注意力增强编码器-解码器能力,实现了最优的率失真性能,在比特率节省和音质提升方面取得显著成效。
English: The proposed rate-aware learned speech compression scheme replaces traditional quantizers with a channel-wise entropy model and enhances encoder-decoder capacity through multi-scale convolution and linear attention, achieving state-of-the-art rate-distortion performance with significant bitrate savings and quality gains.

Authors:Xiangyuan Peng, Huawei Sun, Kay Bierzynski, Anton Fischbacher, Lorenzo Servadei, Robert Wille
Title: MutualForce: Mutual-Aware Enhancement for 4D Radar-LiDAR 3D Object Detection
Abstract:
Radar and LiDAR have been widely used in autonomous driving as LiDAR provides rich structure information, and radar demonstrates high robustness under adverse weather. Recent studies highlight the effectiveness of fusing radar and LiDAR point clouds. However, challenges remain due to the modality misalignment and information loss during feature extractions. To address these issues, we propose a 4D radar-LiDAR framework to mutually enhance their representations. Initially, the indicative features from radar are utilized to guide both radar and LiDAR geometric feature learning. Subsequently, to mitigate their sparsity gap, the shape information from LiDAR is used to enrich radar BEV features. Extensive experiments on the View-of-Delft (VoD) dataset demonstrate our approach's superiority over existing methods, achieving the highest mAP of 71.76% across the entire area and 86.36\% within the driving corridor. Especially for cars, we improve the AP by 4.17% and 4.20% due to the strong indicative features and symmetric shapes.
Chinese: 提出的4D雷达-激光雷达融合框架通过结合雷达的指示性特征与激光雷达的几何信息相互增强表征,在VoD数据集上以71.76% mAP实现了最优性能。
English: The proposed 4D radar-LiDAR fusion framework mutually enhances geometric features by leveraging radar's indicative signals and LiDAR's shape information, achieving state-of-the-art performance with 71.76% mAP on the VoD dataset.

Authors:Hyeon Jeon, Hyunwook Lee, Yun-Hsin Kuo, Taehyun Yang, Daniel Archambault, Sungahn Ko, Takanori Fujiwara, Kwan-Liu Ma, Jinwook Seo
Title: Unveiling High-dimensional Backstage: A Survey for Reliable Visual Analytics with Dimensionality Reduction
Abstract:
Dimensionality reduction (DR) techniques are essential for visually analyzing high-dimensional data. However, visual analytics using DR often face unreliability, stemming from factors such as inherent distortions in DR projections. This unreliability can lead to analytic insights that misrepresent the underlying data, potentially resulting in misguided decisions. To tackle these reliability challenges, we review 133 papers that address the unreliability of visual analytics using DR. Through this review, we contribute (1) a workflow model that describes the interaction between analysts and machines in visual analytics using DR, and (2) a taxonomy that identifies where and why reliability issues arise within the workflow, along with existing solutions for addressing them. Our review reveals ongoing challenges in the field, whose significance and urgency are validated by five expert researchers. This review also finds that the current research landscape is skewed toward developing new DR techniques rather than their interpretation or evaluation, where we discuss how the HCI community can contribute to broadening this focus.
中文: 本文通过综述133篇论文,针对降维技术用于可视化分析时的不可靠性提出了工作流模型和分类法,以识别并解决可靠性问题,同时指出当前研究过于偏重技术开发而忽视解释评估的现状。
English: This review of 133 papers addresses the unreliability in visual analytics using dimensionality reduction by proposing a workflow model and taxonomy to identify and solve reliability issues, highlighting the field's current overemphasis on technique development over interpretation.

Authors:Ancheng Xu, Di Yang, Renhao Li, Jingwei Zhu, Minghuan Tan, Min Yang, Wanxin Qiu, Mingchen Ma, Haihong Wu, Bingyu Li, Feng Sha, Chengming Li, Xiping Hu, Qiang Qu, Derek F. Wong, Ruifeng Xu
Title: AutoCBT: An Autonomous Multi-agent Framework for Cognitive Behavioral Therapy in Psychological Counseling
Abstract:
Traditional in-person psychological counseling remains primarily niche, often chosen by individuals with psychological issues, while online automated counseling offers a potential solution for those hesitant to seek help due to feelings of shame. Cognitive Behavioral Therapy (CBT) is an essential and widely used approach in psychological counseling. The advent of large language models (LLMs) and agent technology enables automatic CBT diagnosis and treatment. However, current LLM-based CBT systems use agents with a fixed structure, limiting their self-optimization capabilities, or providing hollow, unhelpful suggestions due to redundant response patterns. In this work, we utilize Quora-like and YiXinLi single-round consultation models to build a general agent framework that generates high-quality responses for single-turn psychological consultation scenarios. We use a bilingual dataset to evaluate the quality of single-response consultations generated by each framework. Then, we incorporate dynamic routing and supervisory mechanisms inspired by real psychological counseling to construct a CBT-oriented autonomous multi-agent framework, demonstrating its general applicability. Experimental results indicate that AutoCBT can provide higher-quality automated psychological counseling services.
中文: 传统心理咨询常因羞耻感而受限,而基于大语言模型的在线自动系统虽能提供认知行为疗法,但现有固定结构智能体易产生无效建议,因此开发了AutoCBT多智能体框架,通过动态路由和监督机制显著提升了自动化心理咨询服务质量。
English: Traditional psychological counseling is often limited by stigma, while online automated systems using large language models can offer accessible Cognitive Behavioral Therapy, though current fixed-structure agents produce unhelpful responses, leading to the development of AutoCBT, a multi-agent framework that enhances service quality through dynamic routing and supervision.

Authors:Seohyun Lee, Wenzhi Fang, Anindya Bijoy Das, Seyyedali Hosseinalipour, David J. Love, Christopher G. Brinton
Title: Cooperative Decentralized Backdoor Attacks on Vertical Federated Learning
Abstract:
Federated learning (FL) is vulnerable to backdoor attacks, where adversaries alter model behavior on target classification labels by embedding triggers into data samples. While these attacks have received considerable attention in horizontal FL, they are less understood for vertical FL (VFL), where devices hold different features of the samples, and only the server holds the labels. In this work, we propose a novel backdoor attack on VFL which (i) does not rely on gradient information from the server and (ii) considers potential collusion among multiple adversaries for sample selection and trigger embedding. Our label inference model augments variational autoencoders with metric learning, which adversaries can train locally. A consensus process over the adversary graph topology determines which datapoints to poison. We further propose methods for trigger splitting across the adversaries, with an intensity-based implantation scheme skewing the server towards the trigger. Our convergence analysis reveals the impact of backdoor perturbations on VFL indicated by a stationarity gap for the trained model, which we verify empirically as well. We conduct experiments comparing our attack with recent backdoor VFL approaches, finding that ours obtains significantly higher success rates for the same main task performance despite not using server information. Additionally, our results verify the impact of collusion on attack performance.
中文摘要:本研究提出了一种新型纵向联邦学习后门攻击方法,该方法无需服务器梯度信息,通过敌手协同投毒与触发模式分割,在保持模型性能的同时显著提升攻击成功率。
English Summary: This study introduces a novel backdoor attack on vertical federated learning that operates without server gradient information, utilizes collusion among adversaries for coordinated poisoning, and achieves higher attack success while maintaining model performance.

Authors:Seohyun Lee, Wenzhi Fang, Anindya Bijoy Das, Seyyedali Hosseinalipour, David J. Love, Christopher G. Brinton
Title: Cooperative Decentralized Backdoor Attacks on Vertical Federated Learning
Abstract:
Federated learning (FL) is vulnerable to backdoor attacks, where adversaries alter model behavior on target classification labels by embedding triggers into data samples. While these attacks have received considerable attention in horizontal FL, they are less understood for vertical FL (VFL), where devices hold different features of the samples, and only the server holds the labels. In this work, we propose a novel backdoor attack on VFL which (i) does not rely on gradient information from the server and (ii) considers potential collusion among multiple adversaries for sample selection and trigger embedding. Our label inference model augments variational autoencoders with metric learning, which adversaries can train locally. A consensus process over the adversary graph topology determines which datapoints to poison. We further propose methods for trigger splitting across the adversaries, with an intensity-based implantation scheme skewing the server towards the trigger. Our convergence analysis reveals the impact of backdoor perturbations on VFL indicated by a stationarity gap for the trained model, which we verify empirically as well. We conduct experiments comparing our attack with recent backdoor VFL approaches, finding that ours obtains significantly higher success rates for the same main task performance despite not using server information. Additionally, our results verify the impact of collusion on attack performance.
中文摘要:本研究提出了一种新型纵向联邦学习后门攻击方法,该方法无需服务器梯度信息,通过敌手协同投毒与触发模式分割,在保持模型性能的同时显著提升攻击成功率。
English Summary: This study introduces a novel backdoor attack on vertical federated learning that operates without server gradient information, utilizes collusion among adversaries for coordinated poisoning, and achieves higher attack success while maintaining model performance.

Authors:Jarett Dewbury, Chi-en Amy Tai, Alexander Wong
Title: Cancer-Net PCa-Seg: Benchmarking Deep Learning Models for Prostate Cancer Segmentation Using Synthetic Correlated Diffusion Imaging
Abstract:
Prostate cancer (PCa) is the most prevalent cancer among men in the United States, accounting for nearly 300,000 cases, 29\% of all diagnoses and 35,000 total deaths in 2024. Traditional screening methods such as prostate-specific antigen (PSA) testing and magnetic resonance imaging (MRI) have been pivotal in diagnosis, but have faced limitations in specificity and generalizability. In this paper, we explore the potential of enhancing PCa gland segmentation using a novel MRI modality called synthetic correlated diffusion imaging (CDI$^s$). We employ several state-of-the-art deep learning models, including U-Net, SegResNet, Swin UNETR, Attention U-Net, and LightM-UNet, to segment prostate glands from a 200 CDI$^s$ patient cohort. We find that SegResNet achieved superior segmentation performance with a Dice-Sorensen coefficient (DSC) of $76.68 \pm 0.8$. Notably, the Attention U-Net, while slightly less accurate (DSC $74.82 \pm 2.0$), offered a favorable balance between accuracy and computational efficiency. Our findings demonstrate the potential of deep learning models in improving prostate gland segmentation using CDI$^s$ to enhance PCa management and clinical support.
中文: 本研究证明,深度学习模型(尤其是SegResNet)通过合成相关扩散成像(CDI$^s$)显著提升了前列腺腺体分割效果,为前列腺癌诊疗提供了更精准高效的解决方案。
English: This study demonstrates that deep learning models, particularly SegResNet, significantly enhance prostate gland segmentation using synthetic correlated diffusion imaging (CDI$^s$), offering improved accuracy and efficiency for prostate cancer management.

Authors:Chenyang Si, Weichen Fan, Zhengyao Lv, Ziqi Huang, Yu Qiao, Ziwei Liu
Title: RepVideo: Rethinking Cross-Layer Representation for Video Generation
Abstract:
Video generation has achieved remarkable progress with the introduction of diffusion models, which have significantly improved the quality of generated videos. However, recent research has primarily focused on scaling up model training, while offering limited insights into the direct impact of representations on the video generation process. In this paper, we initially investigate the characteristics of features in intermediate layers, finding substantial variations in attention maps across different layers. These variations lead to unstable semantic representations and contribute to cumulative differences between features, which ultimately reduce the similarity between adjacent frames and negatively affect temporal coherence. To address this, we propose RepVideo, an enhanced representation framework for text-to-video diffusion models. By accumulating features from neighboring layers to form enriched representations, this approach captures more stable semantic information. These enhanced representations are then used as inputs to the attention mechanism, thereby improving semantic expressiveness while ensuring feature consistency across adjacent frames. Extensive experiments demonstrate that our RepVideo not only significantly enhances the ability to generate accurate spatial appearances, such as capturing complex spatial relationships between multiple objects, but also improves temporal consistency in video generation.
中文摘要:本文提出RepVideo框架,通过聚合相邻层特征来增强文本到视频扩散模型的语义表示稳定性,显著提升了生成视频的空间准确性(如复杂物体关系)和时间连贯性。
English Summary: This paper introduces RepVideo, a representation framework that enhances text-to-video diffusion models by aggregating features from adjacent layers to achieve more stable semantic representations and improve both spatial accuracy and temporal coherence in generated videos.

Authors:Joonho Ko, Jinheon Baek, Sung Ju Hwang
Title: Efficient Real-time Refinement of Language Model Text Generation
Abstract:
Large language models (LLMs) have shown remarkable performance across a wide range of natural language tasks. However, a critical challenge remains in that they sometimes generate factually incorrect answers. To address this, while many previous work has focused on identifying errors in their generation and further refining them, they are slow in deployment since they are designed to verify the response from LLMs only after their entire generation (from the first to last tokens) is done. Further, we observe that once LLMs generate incorrect tokens early on, there is a higher likelihood that subsequent tokens will also be factually incorrect. To this end, in this work, we propose Streaming-VR (Streaming Verification and Refinement), a novel approach designed to enhance the efficiency of verification and refinement of LLM outputs. Specifically, the proposed Streaming-VR enables on-the-fly verification and correction of tokens as they are being generated, similar to a streaming process, ensuring that each subset of tokens is checked and refined in real-time by another LLM as the LLM constructs its response. Through comprehensive evaluations on multiple datasets, we demonstrate that our approach not only enhances the factual accuracy of LLMs, but also offers a more efficient solution compared to prior refinement methods.
中文: 本文提出Streaming-VR方法,通过在大型语言模型生成过程中实时验证和修正令牌,相比现有方法能更高效地提升输出内容的 factual 准确性。
English: This paper introduces Streaming-VR, a novel method that enables real-time verification and refinement of tokens during LLM generation to improve factual accuracy and efficiency over previous approaches.

Authors:Dongwon Kim, Ju He, Qihang Yu, Chenglin Yang, Xiaohui Shen, Suha Kwak, Liang-Chieh Chen
Title: Democratizing Text-to-Image Masked Generative Models with Compact Text-Aware One-Dimensional Tokens
Abstract:
Image tokenizers form the foundation of modern text-to-image generative models but are notoriously difficult to train. Furthermore, most existing text-to-image models rely on large-scale, high-quality private datasets, making them challenging to replicate. In this work, we introduce Text-Aware Transformer-based 1-Dimensional Tokenizer (TA-TiTok), an efficient and powerful image tokenizer that can utilize either discrete or continuous 1-dimensional tokens. TA-TiTok uniquely integrates textual information during the tokenizer decoding stage (i.e., de-tokenization), accelerating convergence and enhancing performance. TA-TiTok also benefits from a simplified, yet effective, one-stage training process, eliminating the need for the complex two-stage distillation used in previous 1-dimensional tokenizers. This design allows for seamless scalability to large datasets. Building on this, we introduce a family of text-to-image Masked Generative Models (MaskGen), trained exclusively on open data while achieving comparable performance to models trained on private data. We aim to release both the efficient, strong TA-TiTok tokenizers and the open-data, open-weight MaskGen models to promote broader access and democratize the field of text-to-image masked generative models.
中文: 本研究提出了TA-TiTok图像分词器,通过在解码阶段融合文本信息提升性能与扩展性,并基于此开发了仅使用开放数据即可达到媲美私有数据训练效果的MaskGen生成模型。
English: The study introduces TA-TiTok, an efficient image tokenizer that integrates text during decoding to improve performance and scalability, and MaskGen models that achieve competitive results using only open data.

Authors:Niccolo' Di Marco, Edoardo Loru, Alessandro Galeazzi, Matteo Cinelli, Walter Quattrociocchi
Title: Decoding Musical Evolution Through Network Science
Abstract:
Music has always been central to human culture, reflecting and shaping traditions, emotions, and societal changes. Technological advancements have transformed how music is created and consumed, influencing tastes and the music itself. In this study, we use Network Science to analyze musical complexity. Drawing on $\approx20,000$ MIDI files across six macro-genres spanning nearly four centuries, we represent each composition as a weighted directed network to study its structural properties. Our results show that Classical and Jazz compositions have higher complexity and melodic diversity than recently developed genres. However, a temporal analysis reveals a trend toward simplification, with even Classical and Jazz nearing the complexity levels of modern genres. This study highlights how digital tools and streaming platforms shape musical evolution, fostering new genres while driving homogenization and simplicity.
中文摘要:本研究运用网络科学分析六个音乐流派的复杂性,发现古典乐和爵士乐具有更高的复杂性和旋律多样性,但受数字工具和流媒体平台影响,所有流派均呈现随时间推移而简化的趋势。
English Summary: This study applies Network Science to analyze musical complexity across six genres, revealing that Classical and Jazz exhibit higher complexity and melodic diversity, yet a trend toward simplification is observed over time, influenced by digital tools and streaming platforms.

Authors:Edoardo Di Martino, Matteo Cinelli, Roy Cerqueti, Walter Quattrociocchi
Title: Quantifying Polarization: A Comparative Study of Measures and Methods
Abstract:
Political polarization, a key driver of social fragmentation, has drawn increasing attention for its role in shaping online and offline discourse. Despite significant efforts, accurately measuring polarization within ideological distributions remains a challenge. This study evaluates five widely used polarization measures, testing their strengths and weaknesses with synthetic datasets and a real-world case study on YouTube discussions during the 2020 U.S. Presidential Election. Building on these findings, we present a novel adaptation of Kleinberg's burst detection algorithm to improve mode detection in polarized distributions. By offering both a critical review and an innovative methodological tool, this work advances the analysis of ideological patterns in social media discourse.
中文摘要:本研究评估了五种极化测量方法,并提出了改进的突发检测算法以提升意识形态分布中的模态识别能力,从而推动社交媒体话语分析的发展。
English Summary: This study critically assesses five polarization measures and introduces an enhanced burst detection algorithm to improve mode identification in ideological distributions, advancing the analysis of social media discourse.

Authors:Jie Yang, Ehsan Latif, Yuze He, Xiaoming Zhai
Title: Fine-tuning ChatGPT for Automatic Scoring of Written Scientific Explanations in Chinese
Abstract:
The development of explanations for scientific phenomena is essential in science assessment, but scoring student-written explanations remains challenging and resource-intensive. Large language models (LLMs) have shown promise in addressing this issue, particularly in alphabetic languages like English. However, their applicability to logographic languages is less explored. This study investigates the potential of fine-tuning ChatGPT, a leading LLM, to automatically score scientific explanations written in Chinese. Student responses to seven scientific explanation tasks were collected and automatically scored, with scoring accuracy examined in relation to reasoning complexity using the Kendall correlation. A qualitative analysis explored how linguistic features influenced scoring accuracy. The results show that domain-specific adaptation enables ChatGPT to score Chinese scientific explanations with accuracy. However, scoring accuracy correlates with reasoning complexity: a negative correlation for lower-level responses and a positive one for higher-level responses. The model overrates complex reasoning in low-level responses with intricate sentence structures and underrates high-level responses using concise causal reasoning. These correlations stem from linguistic features--simplicity and clarity enhance accuracy for lower-level responses, while comprehensiveness improves accuracy for higher-level ones. Simpler, shorter responses tend to score more accurately at lower levels, whereas longer, information-rich responses yield better accuracy at higher levels. These findings demonstrate the effectiveness of LLMs in automatic scoring within a Chinese context and emphasize the importance of linguistic features and reasoning complexity in fine-tuning scoring models for educational assessments.
中文摘要:通过微调ChatGPT可实现对中文科学解释的自动评分,但评分准确性受推理复杂性和语言特征影响,简单低阶回答和详尽高阶回答的评分更准确。
English Summary: Fine-tuning ChatGPT enables accurate automatic scoring of Chinese scientific explanations, though accuracy varies with reasoning complexity and linguistic features, showing higher precision for simpler low-level responses and more comprehensive high-level ones.

Authors:Ruizhong Qiu, Jun-Gi Jang, Xiao Lin, Lihui Liu, Hanghang Tong
Title: TUCKET: A Tensor Time Series Data Structure for Efficient and Accurate Factor Analysis over Time Ranges
Abstract:
Tucker decomposition has been widely used in a variety of applications to obtain latent factors of tensor data. In these applications, a common need is to compute Tucker decomposition for a given time range. Furthermore, real-world tensor time series are typically evolving in the time dimension. Such needs call for a data structure that can efficiently and accurately support range queries of Tucker decomposition and stream updates. Unfortunately, existing methods do not support either range queries or stream updates. This challenging problem has remained open for years prior to our work. To solve this challenging problem, we propose TUCKET, a data structure that can efficiently and accurately handle both range queries and stream updates. Our key idea is to design a new data structure that we call a stream segment tree by generalizing the segment tree, a data structure that was originally invented for computational geometry. For a range query of length $L$, our TUCKET can find $O(\log L)$ nodes (called the hit set) from the tree and efficiently stitch their preprocessed decompositions to answer the range query. We also propose an algorithm to optimally prune the hit set via an approximation of subtensor decomposition. For the $T$-th stream update, our TUCKET modifies only amortized $O(1)$ nodes and only $O(\log T)$ nodes in the worst case. Extensive evaluation demonstrates that our TUCKET consistently achieves the highest efficiency and accuracy across four large-scale datasets. Our TUCKET achieves at least 3 times lower latency and at least 1.4 times smaller reconstruction error than Zoom-Tucker on all datasets.
Chinese: TUCKET是一种新颖的数据结构,通过推广线段树来高效支持Tucker分解的范围查询和流式更新,在速度和精度上均优于现有方法。
English: TUCKET is a novel data structure that efficiently supports both range queries and stream updates for Tucker decomposition by generalizing the segment tree, achieving superior speed and accuracy over existing methods.

Authors:Tong Liu, Xiao Yu, Wenxuan Zhou, Jindong Gu, Volker Tresp
Title: FocalPO: Enhancing Preference Optimizing by Focusing on Correct Preference Rankings
Abstract:
Efficient preference optimization algorithms such as Direct Preference Optimization (DPO) have become a popular approach in aligning large language models (LLMs) with human preferences. These algorithms implicitly treat the LLM as a reward model, and focus on training it to correct misranked preference pairs. However, recent work~\citep{chen2024preference} empirically finds that DPO training \textit{rarely improves these misranked preference pairs}, despite its gradient emphasizing on these cases. We introduce FocalPO, a DPO variant that instead \textit{down-weighs} misranked preference pairs and prioritizes enhancing the model's understanding of pairs that it can already rank correctly. Inspired by Focal Loss used in vision tasks, FocalPO achieves this by adding a modulating factor to dynamically scale DPO loss. Our experiment demonstrates that FocalPO surpasses DPO and its variants on popular benchmarks like Alpaca Eval 2.0 using Mistral-Base-7B and Llama-3-Instruct-8B, with the introduced hyperparameter fixed. Additionally, we empirically reveals how FocalPO affects training on correct and incorrect sample groups, further underscoring its effectiveness.
Chinese: FocalPO作为直接偏好优化(DPO)的新变体,通过优先处理正确排序的偏好对而非错误排序的,在Alpaca Eval 2.0等基准测试中超越了DPO,提升了与人类偏好的对齐效果。
English: FocalPO, a novel variant of Direct Preference Optimization (DPO), enhances alignment with human preferences by prioritizing correctly ranked pairs over misranked ones, outperforming DPO on benchmarks like Alpaca Eval 2.0.

Authors:Zhongzhen Huang, Gui Geng, Shengyi Hua, Zhen Huang, Haoyang Zou, Shaoting Zhang, Pengfei Liu, Xiaofan Zhang
Title: O1 Replication Journey -- Part 3: Inference-time Scaling for Medical Reasoning
Abstract:
Building upon our previous investigations of O1 replication (Part 1: Journey Learning [Qin et al., 2024] and Part 2: Distillation [Huang et al., 2024]), this work explores the potential of inference-time scaling in large language models (LLMs) for medical reasoning tasks, ranging from diagnostic decision-making to treatment planning. Through extensive experiments on medical benchmarks of varying complexity (MedQA, Medbullets, and JAMA Clinical Challenges), our investigation reveals several key insights: (1) Increasing inference time does lead to improved performance. With a modest training set of 500 samples, our model yields substantial performance improvements of 6%-11%. (2) Task complexity directly correlates with the required length of reasoning chains, confirming the necessity of extended thought processes for challenging problems. (3) The differential diagnoses generated by our model adhere to the principles of the hypothetico-deductive method, producing a list of potential conditions that may explain a patient's symptoms and systematically narrowing these possibilities by evaluating the evidence. These findings demonstrate the promising synergy between inference-time scaling and journey learning in advancing LLMs' real-world clinical reasoning capabilities.
中文摘要:本研究表明,在大语言模型中增加推理时间能显著提升医学推理性能,其推理链长度与任务复杂度相关,且生成的鉴别诊断符合临床假说演绎法的规范流程。
English Summary: This study demonstrates that increasing inference time in large language models significantly enhances medical reasoning performance, with complexity-dependent reasoning chains and clinically valid diagnostic outputs aligning with hypothetico-deductive principles.

Authors:Yuhong Zhang, Jing Lin, Ailing Zeng, Guanlin Wu, Shunlin Lu, Yurong Fu, Yuanhao Cai, Ruimao Zhang, Haoqian Wang, Lei Zhang
Title: Motion-X++: A Large-Scale Multimodal 3D Whole-body Human Motion Dataset
Abstract:
In this paper, we introduce Motion-X++, a large-scale multimodal 3D expressive whole-body human motion dataset. Existing motion datasets predominantly capture body-only poses, lacking facial expressions, hand gestures, and fine-grained pose descriptions, and are typically limited to lab settings with manually labeled text descriptions, thereby restricting their scalability. To address this issue, we develop a scalable annotation pipeline that can automatically capture 3D whole-body human motion and comprehensive textural labels from RGB videos and build the Motion-X dataset comprising 81.1K text-motion pairs. Furthermore, we extend Motion-X into Motion-X++ by improving the annotation pipeline, introducing more data modalities, and scaling up the data quantities. Motion-X++ provides 19.5M 3D whole-body pose annotations covering 120.5K motion sequences from massive scenes, 80.8K RGB videos, 45.3K audios, 19.5M frame-level whole-body pose descriptions, and 120.5K sequence-level semantic labels. Comprehensive experiments validate the accuracy of our annotation pipeline and highlight Motion-X++'s significant benefits for generating expressive, precise, and natural motion with paired multimodal labels supporting several downstream tasks, including text-driven whole-body motion generation,audio-driven motion generation, 3D whole-body human mesh recovery, and 2D whole-body keypoints estimation, etc.
中文: 本文提出Motion-X++大规模多模态3D全身人体运动数据集,通过自动化标注流程克服了现有数据集仅含身体动作、缺乏面部表情与手势的局限,提供涵盖多场景的海量运动序列与多模态标签,在多个人体运动任务中展现出显著优势。
English: This paper presents Motion-X++, a large-scale multimodal 3D expressive whole-body human motion dataset that overcomes the limitations of existing datasets by providing comprehensive body, facial, and hand motion annotations with rich textual labels through an automated pipeline, validated across multiple downstream tasks.

Authors:Violet Xiang, Charlie Snell, Kanishk Gandhi, Alon Albalak, Anikait Singh, Chase Blagden, Duy Phung, Rafael Rafailov, Nathan Lile, Dakota Mahan, Louis Castricato, Jan-Philipp Franken, Nick Haber, Chelsea Finn
Title: Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought
Abstract:
We propose a novel framework, Meta Chain-of-Thought (Meta-CoT), which extends traditional Chain-of-Thought (CoT) by explicitly modeling the underlying reasoning required to arrive at a particular CoT. We present empirical evidence from state-of-the-art models exhibiting behaviors consistent with in-context search, and explore methods for producing Meta-CoT via process supervision, synthetic data generation, and search algorithms. Finally, we outline a concrete pipeline for training a model to produce Meta-CoTs, incorporating instruction tuning with linearized search traces and reinforcement learning post-training. Finally, we discuss open research questions, including scaling laws, verifier roles, and the potential for discovering novel reasoning algorithms. This work provides a theoretical and practical roadmap to enable Meta-CoT in LLMs, paving the way for more powerful and human-like reasoning in artificial intelligence.
中文:Meta-CoT框架通过建模潜在推理过程扩展了传统思维链,结合过程监督和合成数据生成等方法,并采用指令调优和强化学习的训练流程,为提升人工智能的推理能力提供了理论与实践路径。
English: The Meta Chain-of-Thought (Meta-CoT) framework enhances traditional CoT by modeling underlying reasoning processes, supported by empirical evidence and methods like process supervision and synthetic data generation, with a training pipeline involving instruction tuning and reinforcement learning to advance AI reasoning capabilities.

Authors:Tina Raissi, Ralf Schlüter, Hermann Ney
Title: Right Label Context in End-to-End Training of Time-Synchronous ASR Models
Abstract:
Current time-synchronous sequence-to-sequence automatic speech recognition (ASR) models are trained by using sequence level cross-entropy that sums over all alignments. Due to the discriminative formulation, incorporating the right label context into the training criterion's gradient causes normalization problems and is not mathematically well-defined. The classic hybrid neural network hidden Markov model (NN-HMM) with its inherent generative formulation enables conditioning on the right label context. However, due to the HMM state-tying the identity of the right label context is never modeled explicitly. In this work, we propose a factored loss with auxiliary left and right label contexts that sums over all alignments. We show that the inclusion of the right label context is particularly beneficial when training data resources are limited. Moreover, we also show that it is possible to build a factored hybrid HMM system by relying exclusively on the full-sum criterion. Experiments were conducted on Switchboard 300h and LibriSpeech 960h.
中文: 本研究提出了一种包含左右标签上下文的分解损失函数,解决了ASR训练中的归一化问题,在有限数据条件下效果显著,并可构建基于全和准则的混合HMM系统。
English: This study introduces a factored loss function incorporating auxiliary left and right label contexts to address normalization issues in ASR training, demonstrating particular effectiveness with limited data and enabling full-sum criterion hybrid HMM systems.

Authors:Yuxiao Hu, Qian Li, Dongxiao Zhang, Jinyue Yan, Yuntian Chen
Title: Context-Alignment: Activating and Enhancing LLM Capabilities in Time Series
Abstract:
Recently, leveraging pre-trained Large Language Models (LLMs) for time series (TS) tasks has gained increasing attention, which involves activating and enhancing LLMs' capabilities. Many methods aim to activate LLMs' capabilities based on token-level alignment but overlook LLMs' inherent strength on natural language processing -- their deep understanding of linguistic logic and structure rather than superficial embedding processing. We propose Context-Alignment, a new paradigm that aligns TS with a linguistic component in the language environments familiar to LLMs to enable LLMs to contextualize and comprehend TS data, thereby activating their capabilities. Specifically, such context-level alignment comprises structural alignment and logical alignment, which is achieved by a Dual-Scale Context-Alignment GNNs (DSCA-GNNs) applied to TS-language multimodal inputs. Structural alignment utilizes dual-scale nodes to describe hierarchical structure in TS-language, enabling LLMs treat long TS data as a whole linguistic component while preserving intrinsic token features. Logical alignment uses directed edges to guide logical relationships, ensuring coherence in the contextual semantics. Demonstration examples prompt are employed to construct Demonstration Examples based Context-Alignment (DECA) following DSCA-GNNs framework. DECA can be flexibly and repeatedly integrated into various layers of pre-trained LLMs to improve awareness of logic and structure, thereby enhancing performance. Extensive experiments show the effectiveness of DECA and the importance of Context-Alignment across tasks, particularly in few-shot and zero-shot forecasting, confirming that Context-Alignment provide powerful prior knowledge on context.
中文摘要:本文提出Context-Alignment新范式,通过将时间序列数据与语言环境对齐来激活大语言模型能力,采用双尺度上下文对齐图神经网络实现结构和逻辑对齐,在少样本预测等任务中显著提升模型性能。
English Summary: This paper introduces Context-Alignment, a paradigm that aligns time series data with linguistic structures to enhance LLMs' comprehension, using a Dual-Scale Context-Alignment GNNs framework for structural and logical alignment to improve performance in tasks like few-shot forecasting.

Authors:Tianyang Duan, Zongyuan Zhang, Zheng Lin, Yue Gao, Ling Xiong, Yong Cui, Hongbin Liang, Xianhao Chen, Heming Cui, Dong Huang
Title: Rethinking Adversarial Attacks in Reinforcement Learning from Policy Distribution Perspective
Abstract:
Deep Reinforcement Learning (DRL) suffers from uncertainties and inaccuracies in the observation signal in realworld applications. Adversarial attack is an effective method for evaluating the robustness of DRL agents. However, existing attack methods targeting individual sampled actions have limited impacts on the overall policy distribution, particularly in continuous action spaces. To address these limitations, we propose the Distribution-Aware Projected Gradient Descent attack (DAPGD). DAPGD uses distribution similarity as the gradient perturbation input to attack the policy network, which leverages the entire policy distribution rather than relying on individual samples. We utilize the Bhattacharyya distance in DAPGD to measure policy similarity, enabling sensitive detection of subtle but critical differences between probability distributions. Our experiment results demonstrate that DAPGD achieves SOTA results compared to the baselines in three robot navigation tasks, achieving an average 22.03% higher reward drop compared to the best baseline.
Chinese: 提出的分布感知投影梯度下降(DAPGD)攻击通过利用巴氏距离针对整体策略分布进行对抗性评估,在机器人导航任务中相比现有方法实现了22.03%更高的奖励降低,达到了最先进的攻击效果。
English: The proposed Distribution-Aware Projected Gradient Descent (DAPGD) attack enhances adversarial evaluation of Deep Reinforcement Learning by targeting the entire policy distribution using Bhattacharyya distance, achieving state-of-the-art effectiveness with a 22.03% higher reward reduction in robot navigation tasks compared to existing methods.

Authors:Haoxuan Yuan, Zhe Chen, Zheng Lin, Jinbo Peng, Yuhang Zhong, Xuanjie Hu, Songyan Xue, Wei Li, Yue Gao
Title: Constructing 4D Radio Map in LEO Satellite Networks with Limited Samples
Abstract:
Recently, Low Earth Orbit (LEO) satellite networks (i.e., non-terrestrial network (NTN)), such as Starlink, have been successfully deployed to provide broader coverage than terrestrial networks (TN). Due to limited spectrum resources, TN and NTN may soon share the same spectrum. Therefore, fine-grained spectrum monitoring is crucial for spectrum sharing and interference avoidance. To this end, constructing a 4D radio map (RM) including three spatial dimensions and signal spectra is important. However, this requires the large deployment of sensors, and high-speed analog-to-digital converters for extensive spatial signal collection and wide power spectrum acquisition, respectively. To address these challenges, we propose a deep unsupervised learning framework without ground truths labeling requirement, DeepRM, comprised of neural compressive sensing (CS) and tensor decomposition (TD) algorithms. Firstly, we map the CS process into the optimization of a neural networksassociated loss function, and design a sparsity-performance balance training algorithm to reconstruct a wide power spectrum under limited sub-Nquist samples. Secondly, according to the output of neural CS algorithm, we also utilize neural networks to perform TD, and construct the 3D RM for each frequency, even under very sparse sensor deployment. Extensive evaluations show that DeepRM achieves lower error than its corresponding state-of-the-art baselines, especially with limited samples.
中文:低地球轨道卫星网络正扩大覆盖范围,但面临与地面网络共享频谱的挑战,为此开发了DeepRM这一深度无监督学习框架,通过神经压缩感知和张量分解算法高效构建四维无线电地图。
English: Low Earth Orbit satellite networks are expanding coverage but face spectrum sharing challenges with terrestrial networks, leading to the development of DeepRM, a deep unsupervised learning framework that constructs 4D radio maps efficiently using neural compressive sensing and tensor decomposition algorithms.

Authors:Deguo Xia, Weiming Zhang, Xiyan Liu, Wei Zhang, Chenting Gong, Xiao Tan, Jizhou Huang, Mengmeng Yang, Diange Yang
Title: LDMapNet-U: An End-to-End System for City-Scale Lane-Level Map Updating
Abstract:
An up-to-date city-scale lane-level map is an indispensable infrastructure and a key enabling technology for ensuring the safety and user experience of autonomous driving systems. In industrial scenarios, reliance on manual annotation for map updates creates a critical bottleneck. Lane-level updates require precise change information and must ensure consistency with adjacent data while adhering to strict standards. Traditional methods utilize a three-stage approach-construction, change detection, and updating-which often necessitates manual verification due to accuracy limitations. This results in labor-intensive processes and hampers timely updates. To address these challenges, we propose LDMapNet-U, which implements a new end-to-end paradigm for city-scale lane-level map updating. By reconceptualizing the update task as an end-to-end map generation process grounded in historical map data, we introduce a paradigm shift in map updating that simultaneously generates vectorized maps and change information. To achieve this, a Prior-Map Encoding (PME) module is introduced to effectively encode historical maps, serving as a critical reference for detecting changes. Additionally, we incorporate a novel Instance Change Prediction (ICP) module that learns to predict associations with historical maps. Consequently, LDMapNet-U simultaneously achieves vectorized map element generation and change detection. To demonstrate the superiority and effectiveness of LDMapNet-U, extensive experiments are conducted using large-scale real-world datasets. In addition, LDMapNet-U has been successfully deployed in production at Baidu Maps since April 2024, supporting map updating for over 360 cities and significantly shortening the update cycle from quarterly to weekly. The updated maps serve hundreds of millions of users and are integrated into the autonomous driving systems of several leading vehicle companies.
中文: LDMapNet-U提出了一种端到端的城市级车道级地图更新范式,通过整合历史地图编码与实例变化预测,实现了矢量化地图生成与变更检测的同步完成,大幅缩短了更新周期并支持规模化应用。
English: LDMapNet-U introduces an end-to-end paradigm for automated city-scale lane-level map updating, simultaneously generating vectorized maps and detecting changes to overcome manual annotation bottlenecks and significantly accelerate update cycles.

Authors:Zhongjian Cui, Chenrui Cui, Tianrui Wang, Mengnan He, Hao Shi, Meng Ge, Caixia Gong, Longbiao Wang, Jianwu Dang
Title: Reducing the Gap Between Pretrained Speech Enhancement and Recognition Models Using a Real Speech-Trained Bridging Module
Abstract:
The information loss or distortion caused by single-channel speech enhancement (SE) harms the performance of automatic speech recognition (ASR). Observation addition (OA) is an effective post-processing method to improve ASR performance by balancing noisy and enhanced speech. Determining the OA coefficient is crucial. However, the currently supervised OA coefficient module, called the bridging module, only utilizes simulated noisy speech for training, which has a severe mismatch with real noisy speech. In this paper, we propose training strategies to train the bridging module with real noisy speech. First, DNSMOS is selected to evaluate the perceptual quality of real noisy speech with no need for the corresponding clean label to train the bridging module. Additional constraints during training are introduced to enhance the robustness of the bridging module further. Each utterance is evaluated by the ASR back-end using various OA coefficients to obtain the word error rates (WERs). The WERs are used to construct a multidimensional vector. This vector is introduced into the bridging module with multi-task learning and is used to determine the optimal OA coefficients. The experimental results on the CHiME-4 dataset show that the proposed methods all had significant improvement compared with the simulated data trained bridging module, especially under real evaluation sets.
中文摘要:单通道语音增强可能损害自动语音识别性能,但通过采用DNSMOS评估真实噪声语音并结合多任务学习训练桥接模块,能显著提升识别效果,尤其在真实环境中表现更优。
English Summary: Single-channel speech enhancement can degrade automatic speech recognition, but training the bridging module with real noisy speech using DNSMOS and multi-task learning significantly improves performance, especially in real-world scenarios.

Authors:Yuxin Zhang, Haoyu Chen, Zheng Lin, Zhe Chen, Jin Zhao
Title: LCFed: An Efficient Clustered Federated Learning Framework for Heterogeneous Data
Abstract:
Clustered federated learning (CFL) addresses the performance challenges posed by data heterogeneity in federated learning (FL) by organizing edge devices with similar data distributions into clusters, enabling collaborative model training tailored to each group. However, existing CFL approaches strictly limit knowledge sharing to within clusters, lacking the integration of global knowledge with intra-cluster training, which leads to suboptimal performance. Moreover, traditional clustering methods incur significant computational overhead, especially as the number of edge devices increases. In this paper, we propose LCFed, an efficient CFL framework to combat these challenges. By leveraging model partitioning and adopting distinct aggregation strategies for each sub-model, LCFed effectively incorporates global knowledge into intra-cluster co-training, achieving optimal training performance. Additionally, LCFed customizes a computationally efficient model similarity measurement method based on low-rank models, enabling real-time cluster updates with minimal computational overhead. Extensive experiments show that LCFed outperforms state-of-the-art benchmarks in both test accuracy and clustering computational efficiency.
Chinese: LCFed是一种高效的集群联邦学习框架,通过将全局知识融入集群内训练,并采用基于低秩模型的相似性度量方法实现实时聚类,显著提升了性能并降低了计算开销。
English: LCFed is an efficient clustered federated learning framework that enhances performance by integrating global knowledge into intra-cluster training and using a low-rank model-based similarity measurement for real-time clustering with minimal computational overhead.

Authors:Zheng Lin, Yuxin Zhang, Zhe Chen, Zihan Fang, Cong Wu, Xianhao Chen, Yue Gao, Jun Luo
Title: LEO-Split: A Semi-Supervised Split Learning Framework over LEO Satellite Networks
Abstract:
Recently, the increasing deployment of LEO satellite systems has enabled various space analytics (e.g., crop and climate monitoring), which heavily relies on the advancements in deep learning (DL). However, the intermittent connectivity between LEO satellites and ground station (GS) significantly hinders the timely transmission of raw data to GS for centralized learning, while the scaled-up DL models hamper distributed learning on resource-constrained LEO satellites. Though split learning (SL) can be a potential solution to these problems by partitioning a model and offloading primary training workload to GS, the labor-intensive labeling process remains an obstacle, with intermittent connectivity and data heterogeneity being other challenges. In this paper, we propose LEO-Split, a semi-supervised (SS) SL design tailored for satellite networks to combat these challenges. Leveraging SS learning to handle (labeled) data scarcity, we construct an auxiliary model to tackle the training failure of the satellite-GS non-contact time. Moreover, we propose a pseudo-labeling algorithm to rectify data imbalances across satellites. Lastly, an adaptive activation interpolation scheme is devised to prevent the overfitting of server-side sub-model training at GS. Extensive experiments with real-world LEO satellite traces (e.g., Starlink) demonstrate that our LEO-Split framework achieves superior performance compared to state-ofthe-art benchmarks.
中文:LEO-Split框架通过结合半监督分割学习与辅助模型及自适应技术,有效解决了卫星网络中的间歇性连接和数据异构性挑战,在真实场景实验中展现出卓越性能。
English: The LEO-Split framework addresses challenges in satellite networks by combining semi-supervised split learning with an auxiliary model and adaptive techniques to handle intermittent connectivity and data heterogeneity, achieving superior performance in real-world experiments.

Authors:Ehsan Latif, Ying Chen, Xiaoming Zhai, Yue Yin
Title: Human-Centered Design for AI-based Automatically Generated Assessment Reports: A Systematic Review
Abstract:
This paper provides a comprehensive review of the design and implementation of automatically generated assessment reports (AutoRs) for formative use in K-12 Science, Technology, Engineering, and Mathematics (STEM) classrooms. With the increasing adoption of technology-enhanced assessments, there is a critical need for human-computer interactive tools that efficiently support the interpretation and application of assessment data by teachers. AutoRs are designed to provide synthesized, interpretable, and actionable insights into students' performance, learning progress, and areas for improvement. Guided by cognitive load theory, this study emphasizes the importance of reducing teachers' cognitive demands through user-centered and intuitive designs. It highlights the potential of diverse information presentation formats such as text, visual aids, and plots and advanced functionalities such as live and interactive features to enhance usability. However, the findings also reveal that many existing AutoRs fail to fully utilize these approaches, leading to high initial cognitive demands and limited engagement. This paper proposes a conceptual framework to inform the design, implementation, and evaluation of AutoRs, balancing the trade-offs between usability and functionality. The framework aims to address challenges in engaging teachers with technology-enhanced assessment results, facilitating data-driven decision-making, and providing personalized feedback to improve the teaching and learning process.
本文综述了K-12 STEM教育中自动生成评估报告的设计,提出了一个概念框架,旨在通过直观交互功能降低教师认知负荷,提升报告可用性与功能性,促进数据驱动的教学决策。
This paper reviews the design of automatically generated assessment reports for K-12 STEM education, proposing a conceptual framework to enhance their usability and functionality while reducing teachers' cognitive load through intuitive, interactive features.

Authors:Pedro Miguel Sánchez Sánchez, Enrique Tomás Martínez Beltrán, Chao Feng, Gérôme Bovet, Gregorio Martínez Pérez, Alberto Huertas Celdrán
Title: S-VOTE: Similarity-based Voting for Client Selection in Decentralized Federated Learning
Abstract:
Decentralized Federated Learning (DFL) enables collaborative, privacy-preserving model training without relying on a central server. This decentralized approach reduces bottlenecks and eliminates single points of failure, enhancing scalability and resilience. However, DFL also introduces challenges such as suboptimal models with non-IID data distributions, increased communication overhead, and resource usage. Thus, this work proposes S-VOTE, a voting-based client selection mechanism that optimizes resource usage and enhances model performance in federations with non-IID data conditions. S-VOTE considers an adaptive strategy for spontaneous local training that addresses participation imbalance, allowing underutilized clients to contribute without significantly increasing resource costs. Extensive experiments on benchmark datasets demonstrate the S-VOTE effectiveness. More in detail, it achieves lower communication costs by up to 21%, 4-6% faster convergence, and improves local performance by 9-17% compared to baseline methods in some configurations, all while achieving a 14-24% energy consumption reduction. These results highlight the potential of S-VOTE to address DFL challenges in heterogeneous environments.
中文: S-VOTE是一种基于投票的客户端选择机制,用于去中心化联邦学习,在非独立同分布数据条件下优化资源使用并提升模型性能,实现了更快的收敛速度、更低的通信成本和能耗减少。
English: S-VOTE is a voting-based client selection mechanism for Decentralized Federated Learning that optimizes resource usage and enhances model performance in non-IID data conditions, achieving faster convergence, lower communication costs, and reduced energy consumption.

Authors:Orion Weller, Benjamin Chang, Eugene Yang, Mahsa Yarmohammadi, Sam Barham, Sean MacAvaney, Arman Cohan, Luca Soldaini, Benjamin Van Durme, Dawn Lawrie
Title: mFollowIR: a Multilingual Benchmark for Instruction Following in Retrieval
Abstract:
Retrieval systems generally focus on web-style queries that are short and underspecified. However, advances in language models have facilitated the nascent rise of retrieval models that can understand more complex queries with diverse intents. However, these efforts have focused exclusively on English; therefore, we do not yet understand how they work across languages. We introduce mFollowIR, a multilingual benchmark for measuring instruction-following ability in retrieval models. mFollowIR builds upon the TREC NeuCLIR narratives (or instructions) that span three diverse languages (Russian, Chinese, Persian) giving both query and instruction to the retrieval models. We make small changes to the narratives and isolate how well retrieval models can follow these nuanced changes. We present results for both multilingual (XX-XX) and cross-lingual (En-XX) performance. We see strong cross-lingual performance with English-based retrievers that trained using instructions, but find a notable drop in performance in the multilingual setting, indicating that more work is needed in developing data for instruction-based multilingual retrievers.
Chinese Summary: 随着mFollowIR多语言基准的推出,检索系统在处理复杂查询方面取得进展,该基准测试模型跨语言执行细微指令的能力,结果显示跨语言性能强劲,但凸显了改进多语言训练数据的必要性。
English Summary: Retrieval systems are advancing to handle complex queries with the introduction of mFollowIR, a multilingual benchmark that tests models' ability to follow nuanced instructions across languages, revealing strong cross-lingual performance but highlighting the need for improved multilingual training data.

Authors:Krzysztof Byrski, Marcin Mazur, Jacek Tabor, Tadeusz Dziarmaga, Marcin Kądziołka, Dawid Baran, Przemysław Spurek
Title: RaySplats: Ray Tracing based Gaussian Splatting
Abstract:
3D Gaussian Splatting (3DGS) is a process that enables the direct creation of 3D objects from 2D images. This representation offers numerous advantages, including rapid training and rendering. However, a significant limitation of 3DGS is the challenge of incorporating light and shadow reflections, primarily due to the utilization of rasterization rather than ray tracing for rendering. This paper introduces RaySplats, a model that employs ray-tracing based Gaussian Splatting. Rather than utilizing the projection of Gaussians, our method employs a ray-tracing mechanism, operating directly on Gaussian primitives represented by confidence ellipses with RGB colors. In practice, we compute the intersection between ellipses and rays to construct ray-tracing algorithms, facilitating the incorporation of meshes with Gaussian Splatting models and the addition of lights, shadows, and other related effects.
中文: 3D高斯泼溅(3DGS)虽能高效从二维图像生成三维对象,但因采用光栅化渲染而难以处理光影反射;为此提出的RaySplats模型通过光线追踪与高斯图元结合,实现了对光照和阴影效果的优化。
English: 3D Gaussian Splatting (3DGS) enables efficient 3D reconstruction from 2D images but struggles with lighting effects due to its rasterization approach, leading to the development of RaySplats, which integrates ray tracing with Gaussian primitives for enhanced light and shadow rendering.

Authors:Haozhe Jia, Wenshuo Chen, Zhihui Huang, Lei Wang, Hongru Xiao, Nanqian Jia, Keming Wu, Songning Lai, Bowen Tian, Yutao Yue
Title: Physics-Informed Representation Alignment for Sparse Radio-Map Reconstruction
Abstract:
Radio map reconstruction is essential for enabling advanced applications, yet challenges such as complex signal propagation and sparse observational data hinder accurate reconstruction in practical scenarios. Existing methods often fail to align physical constraints with data-driven features, particularly under sparse measurement conditions. To address these issues, we propose **Phy**sics-Aligned **R**adio **M**ap **D**iffusion **M**odel (**PhyRMDM**), a novel framework that establishes cross-domain representation alignment between physical principles and neural network features through dual learning pathways. The proposed model integrates **Physics-Informed Neural Networks (PINNs)** with a **representation alignment mechanism** that explicitly enforces consistency between Helmholtz equation constraints and environmental propagation patterns. Experimental results demonstrate significant improvements over state-of-the-art methods, achieving **NMSE of 0.0031** under *Static Radio Map (SRM)* conditions, and **NMSE of 0.0047** with **Dynamic Radio Map (DRM)** scenarios. The proposed representation alignment paradigm provides **37.2%** accuracy enhancement in ultra-sparse cases (**1%** sampling rate), confirming its effectiveness in bridging physics-based modeling and deep learning for radio map reconstruction.
中文摘要:提出的物理对齐无线电地图扩散模型通过双学习路径将物理约束与神经网络特征相结合,在稀疏测量条件下显著提升了静态和动态无线电地图重建的准确性。
English Summary: The proposed Physics-Aligned Radio Map Diffusion Model (PhyRMDM) integrates physical constraints with neural network features through dual learning pathways, achieving significant accuracy improvements in both static and dynamic radio map reconstruction under sparse measurement conditions.

Authors:Nhan Phan, Thu Nguyen, PÃ¥l Halvorsen, Michael A. Riegler
Title: Principal Components for Neural Network Initialization
Abstract:
Principal Component Analysis (PCA) is a commonly used tool for dimension reduction and denoising. Therefore, it is also widely used on the data prior to training a neural network. However, this approach can complicate the explanation of explainable AI (XAI) methods for the decision of the model. In this work, we analyze the potential issues with this approach and propose Principal Components-based Initialization (PCsInit), a strategy to incorporate PCA into the first layer of a neural network via initialization of the first layer in the network with the principal components, and its two variants PCsInit-Act and PCsInit-Sub. Explanations using these strategies are as direct and straightforward as for neural networks and are simpler than using PCA prior to training a neural network on the principal components. Moreover, as will be illustrated in the experiments, such training strategies can also allow further improvement of training via backpropagation.
中文: PCA常用于神经网络训练前的数据预处理,但会使可解释AI的分析复杂化,因此本研究提出PCsInit方法,通过将主成分融入网络初始化层来保持解释的直观性并提升训练效果。
English: PCA is widely used for preprocessing data before neural network training but complicates explainable AI interpretations, so this study introduces PCsInit, a method integrating PCA into network initialization to maintain straightforward explanations and enhance training efficiency.

Authors:Xiangyu Sun, Xiaoguang Zou, Yuanquan Wu, Guotai Wang, Shaoting Zhang
Title: Fairness Analysis of CLIP-Based Foundation Models for X-Ray Image Classification
Abstract:
X-ray imaging is pivotal in medical diagnostics, offering non-invasive insights into a range of health conditions. Recently, vision-language models, such as the Contrastive Language-Image Pretraining (CLIP) model, have demonstrated potential in improving diagnostic accuracy by leveraging large-scale image-text datasets. However, since CLIP was not initially designed for medical images, several CLIP-like models trained specifically on medical images have been developed. Despite their enhanced performance, issues of fairness - particularly regarding demographic attributes - remain largely unaddressed. In this study, we perform a comprehensive fairness analysis of CLIP-like models applied to X-ray image classification. We assess their performance and fairness across diverse patient demographics and disease categories using zero-shot inference and various fine-tuning techniques, including Linear Probing, Multilayer Perceptron (MLP), Low-Rank Adaptation (LoRA), and full fine-tuning. Our results indicate that while fine-tuning improves model accuracy, fairness concerns persist, highlighting the need for further fairness interventions in these foundational models.
中文摘要:本研究对CLIP类模型在X射线图像分类中的公平性进行全面分析,结果表明尽管微调能提高诊断准确性,但未能解决涉及患者人口统计特征的公平性问题,凸显了实施针对性公平干预的必要性。
English Summary: This study conducts a comprehensive fairness analysis of CLIP-like models for X-ray image classification, revealing that while fine-tuning enhances diagnostic accuracy, it fails to resolve fairness issues related to patient demographics, underscoring the need for targeted fairness interventions.

Authors:Xinshuai Dong, Ignavier Ng, Boyang Sun, Haoyue Dai, Guang-Yuan Hao, Shunxing Fan, Peter Spirtes, Yumou Qiu, Kun Zhang
Title: Permutation-Based Rank Test in the Presence of Discretization and Application in Causal Discovery with Mixed Data
Abstract:
Recent advances have shown that statistical tests for the rank of cross-covariance matrices play an important role in causal discovery. These rank tests include partial correlation tests as special cases and provide further graphical information about latent variables. Existing rank tests typically assume that all the continuous variables can be perfectly measured, and yet, in practice many variables can only be measured after discretization. For example, in psychometric studies, the continuous level of certain personality dimensions of a person can only be measured after being discretized into order-preserving options such as disagree, neutral, and agree. Motivated by this, we propose Mixed data Permutation-based Rank Test (MPRT), which properly controls the statistical errors even when some or all variables are discretized. Theoretically, we establish the exchangeability and estimate the asymptotic null distribution by permutations; as a consequence, MPRT can effectively control the Type I error in the presence of discretization while previous methods cannot. Empirically, our method is validated by extensive experiments on synthetic data and real-world data to demonstrate its effectiveness as well as applicability in causal discovery.
Chinese: 提出的混合数据置换秩检验(MPRT)在变量离散化情况下有效控制交叉协方差矩阵秩检验的统计误差,通过理论保证和实证验证解决了现有方法的局限性。
English: The proposed Mixed data Permutation-based Rank Test (MPRT) effectively controls statistical errors for cross-covariance matrix rank testing when variables are discretized, addressing limitations of existing methods through theoretical guarantees and empirical validation.

Authors:Matthew Neeley, Guantong Qi, Guanchu Wang, Ruixiang Tang, Dongxue Mao, Chaozhong Liu, Sasidhar Pasupuleti, Bo Yuan, Fan Xia, Pengfei Liu, Zhandong Liu, Xia Hu
Title: Survey and Improvement Strategies for Gene Prioritization with Large Language Models
Abstract:
Rare diseases are challenging to diagnose due to limited patient data and genetic diversity. Despite advances in variant prioritization, many cases remain undiagnosed. While large language models (LLMs) have performed well in medical exams, their effectiveness in diagnosing rare genetic diseases has not been assessed. To identify causal genes, we benchmarked various LLMs for gene prioritization. Using multi-agent and Human Phenotype Ontology (HPO) classification, we categorized patients based on phenotypes and solvability levels. As gene set size increased, LLM performance deteriorated, so we used a divide-and-conquer strategy to break the task into smaller subsets. At baseline, GPT-4 outperformed other LLMs, achieving near 30% accuracy in ranking causal genes correctly. The multi-agent and HPO approaches helped distinguish confidently solved cases from challenging ones, highlighting the importance of known gene-phenotype associations and phenotype specificity. We found that cases with specific phenotypes or clear associations were more accurately solved. However, we observed biases toward well-studied genes and input order sensitivity, which hindered gene prioritization. Our divide-and-conquer strategy improved accuracy by overcoming these biases. By utilizing HPO classification, novel multi-agent techniques, and our LLM strategy, we improved causal gene identification accuracy compared to our baseline evaluation. This approach streamlines rare disease diagnosis, facilitates reanalysis of unsolved cases, and accelerates gene discovery, supporting the development of targeted diagnostics and therapies.
中文摘要:本研究评估大语言模型在罕见病基因优先排序中的应用,发现GPT-4模型准确率达30%,通过多智能体与人类表型本体分类的分治策略有效克服模型偏差,显著提升致病基因识别效率。
English Summary: This study evaluates large language models for rare disease gene prioritization, finding that GPT-4 achieved 30% accuracy and that multi-agent approaches with HPO classification improved diagnosis by addressing model biases through a divide-and-conquer strategy.

Authors:Wenshuo Chen, Haozhe Jia, Songning Lai, Keming Wu, Hongru Xiao, Lijie Hu, Yutao Yue
Title: Free-T2M: Frequency Enhanced Text-to-Motion Diffusion Model With Consistency Loss
Abstract:
Rapid progress in text-to-motion generation has been largely driven by diffusion models. However, existing methods focus solely on temporal modeling, thereby overlooking frequency-domain analysis. We identify two key phases in motion denoising: the **semantic planning stage** and the **fine-grained improving stage**. To address these phases effectively, we propose **Fre**quency **e**nhanced **t**ext-**to**-**m**otion diffusion model (**Free-T2M**), incorporating stage-specific consistency losses that enhance the robustness of static features and improve fine-grained accuracy. Extensive experiments demonstrate the effectiveness of our method. Specifically, on StableMoFusion, our method reduces the FID from **0.189** to **0.051**, establishing a new SOTA performance within the diffusion architecture. These findings highlight the importance of incorporating frequency-domain insights into text-to-motion generation for more precise and robust results.
Chinese: 提出的Free-T2M模型通过引入频域分析和阶段一致性损失,在文本生成动作任务中实现了最先进性能,显著提升了语义规划和精细动作的生成质量。
English: The proposed Free-T2M model introduces frequency-domain analysis and stage-specific consistency losses to enhance text-to-motion generation, achieving state-of-the-art performance by improving both semantic planning and fine-grained motion details.

Authors:Xinzhe Xia, Weiguang Zhao, Yuyao Yan, Guanyu Yang, Rui Zhang, Kaizhu Huang, Xi Yang
Title: Towards Training-Free Open-World Classification with 3D Generative Models
Abstract:
3D open-world classification is a challenging yet essential task in dynamic and unstructured real-world scenarios, requiring both open-category and open-pose recognition. To address these challenges, recent wisdom often takes sophisticated 2D pre-trained models to provide enriched and stable representations. However, these methods largely rely on how 3D objects can be projected into 2D space, which is unfortunately not well solved, and thus significantly limits their performance. Unlike these present efforts, in this paper we make a pioneering exploration of 3D generative models for 3D open-world classification. Drawing on abundant prior knowledge from 3D generative models, we additionally craft a rotation-invariant feature extractor. This innovative synergy endows our pipeline with the advantages of being training-free, open-category, and pose-invariant, thus well suited to 3D open-world classification. Extensive experiments on benchmark datasets demonstrate the potential of generative models in 3D open-world classification, achieving state-of-the-art performance on ModelNet10 and McGill with 32.0% and 8.7% overall accuracy improvement, respectively.
Chinese: 本文开创性地将3D生成模型与旋转不变特征提取器相结合,用于3D开放世界分类,无需训练即可实现开放类别和姿态不变性,并在基准数据集上取得了最先进的性能。
English: This paper pioneers the use of 3D generative models combined with a rotation-invariant feature extractor for 3D open-world classification, achieving state-of-the-art performance without requiring training while being open-category and pose-invariant.

Authors:Ashish Bastola, Hao Wang, Abolfazl Razi
Title: Anomaly Detection in Cooperative Vehicle Perception Systems under Imperfect Communication
Abstract:
Anomaly detection is a critical requirement for ensuring safety in autonomous driving. In this work, we leverage Cooperative Perception to share information across nearby vehicles, enabling more accurate identification and consensus of anomalous behaviors in complex traffic scenarios. To account for the real-world challenge of imperfect communication, we propose a cooperative-perception-based anomaly detection framework (CPAD), which is a robust architecture that remains effective under communication interruptions, thereby facilitating reliable performance even in low-bandwidth settings. Since no multi-agent anomaly detection dataset exists for vehicle trajectories, we introduce 15,000 different scenarios with a 90,000 trajectories benchmark dataset generated through rule-based vehicle dynamics analysis. Empirical results demonstrate that our approach outperforms standard anomaly classification methods in F1-score, AUC and showcase strong robustness to agent connection interruptions.
中文摘要:本研究提出了一种基于协同感知的异常检测框架(CPAD),通过车辆间信息共享提升自动驾驶中的异常行为识别能力,基于新建的轨迹基准数据集验证了该方法在通信受限情况下仍保持优越性能和强鲁棒性。
English Summary: This study introduces a cooperative-perception-based anomaly detection framework (CPAD) that enables vehicles to share information for improved anomaly identification in autonomous driving, demonstrating superior performance and robustness even under communication constraints through a newly created benchmark dataset.

Authors:Edoardo Ghignone, Nicolas Baumann, Cheng Hu, Jonathan Wang, Lei Xie, Andrea Carron, Michele Magno
Title: RLPP: A Residual Method for Zero-Shot Real-World Autonomous Racing on Scaled Platforms
Abstract:
Autonomous racing presents a complex environment requiring robust controllers capable of making rapid decisions under dynamic conditions. While traditional controllers based on tire models are reliable, they often demand extensive tuning or system identification. Reinforcement Learning (RL) methods offer significant potential due to their ability to learn directly from interaction, yet they typically suffer from the sim-to-real gap, where policies trained in simulation fail to perform effectively in the real world. In this paper, we propose RLPP, a residual RL framework that enhances a Pure Pursuit (PP) controller with an RL-based residual. This hybrid approach leverages the reliability and interpretability of PP while using RL to fine-tune the controller's performance in real-world scenarios. Extensive testing on the F1TENTH platform demonstrates that RLPP improves lap times of the baseline controllers by up to 6.37 %, closing the gap to the State-of-the-Art methods by more than 52 % and providing reliable performance in zero-shot real-world deployment, overcoming key challenges associated with the sim-to-real transfer and reducing the performance gap from simulation to reality by more than 8-fold when compared to the baseline RL controller. The RLPP framework is made available as an open-source tool, encouraging further exploration and advancement in autonomous racing research. The code is available at: www.github.com/forzaeth/rlpp.
Chinese: RLPP框架将可靠的纯追踪控制器与强化学习相结合,显著缩小了仿真与现实的差距,在真实测试中圈速提升最高达6.37%。
English: The RLPP framework combines a reliable Pure Pursuit controller with reinforcement learning to enhance autonomous racing performance, significantly reducing the sim-to-real gap and improving lap times by up to 6.37% in real-world tests.

Authors:Nikolai Kalischek, Michael Oechsle, Fabian Manhardt, Philipp Henzler, Konrad Schindler, Federico Tombari
Title: CubeDiff: Repurposing Diffusion-Based Image Models for Panorama Generation
Abstract:
We introduce a novel method for generating 360° panoramas from text prompts or images. Our approach leverages recent advances in 3D generation by employing multi-view diffusion models to jointly synthesize the six faces of a cubemap. Unlike previous methods that rely on processing equirectangular projections or autoregressive generation, our method treats each face as a standard perspective image, simplifying the generation process and enabling the use of existing multi-view diffusion models. We demonstrate that these models can be adapted to produce high-quality cubemaps without requiring correspondence-aware attention layers. Our model allows for fine-grained text control, generates high resolution panorama images and generalizes well beyond its training set, whilst achieving state-of-the-art results, both qualitatively and quantitatively. Project page: https://cubediff.github.io/
中文: 本文提出一种从文本或图像生成360度全景图的新方法,通过多视角扩散模型将立方体贴图面作为标准透视图像处理,无需复杂对应层即可实现高质量输出和精细控制,达到业界领先水平。
English: This paper presents a new method for creating 360° panoramas from text or images by using multi-view diffusion models to generate cubemap faces as standard perspective images, achieving state-of-the-art quality and fine-grained control without complex correspondence layers.

Authors:Chenru Jiang, Chengrui Zhang, Xi Yang, Jie Sun, Yifei Zhang, Bin Dong, Kaizhu Huang
Title: Consistency Diffusion Models for Single-Image 3D Reconstruction with Priors
Abstract:
This paper delves into the study of 3D point cloud reconstruction from a single image. Our objective is to develop the Consistency Diffusion Model, exploring synergistic 2D and 3D priors in the Bayesian framework to ensure superior consistency in the reconstruction process, a challenging yet critical requirement in this field. Specifically, we introduce a pioneering training framework under diffusion models that brings two key innovations. First, we convert 3D structural priors derived from the initial 3D point cloud as a bound term to increase evidence in the variational Bayesian framework, leveraging these robust intrinsic priors to tightly govern the diffusion training process and bolster consistency in reconstruction. Second, we extract and incorporate 2D priors from the single input image, projecting them onto the 3D point cloud to enrich the guidance for diffusion training. Our framework not only sidesteps potential model learning shifts that may arise from directly imposing additional constraints during training but also precisely transposes the 2D priors into the 3D domain. Extensive experimental evaluations reveal that our approach sets new benchmarks in both synthetic and real-world datasets. The code is included with the submission.
本文提出了Consistency Diffusion模型,通过贝叶斯框架协同利用2D和3D先验,在单图像3D点云重建中实现了卓越的一致性,并在实验中创下了新基准。
This paper introduces the Consistency Diffusion Model, which integrates 2D and 3D priors within a Bayesian framework to achieve superior consistency in single-image 3D point cloud reconstruction, setting new benchmarks in experimental evaluations.

Authors:Piyush Gupta, David Isele, Enna Sachdeva, Pin-Hao Huang, Behzad Dariush, Kwonjoon Lee, Sangjae Bae
Title: Generalized Mission Planning for Heterogeneous Multi-Robot Teams via LLM-constructed Hierarchical Trees
Abstract:
We present a novel mission-planning strategy for heterogeneous multi-robot teams, taking into account the specific constraints and capabilities of each robot. Our approach employs hierarchical trees to systematically break down complex missions into manageable sub-tasks. We develop specialized APIs and tools, which are utilized by Large Language Models (LLMs) to efficiently construct these hierarchical trees. Once the hierarchical tree is generated, it is further decomposed to create optimized schedules for each robot, ensuring adherence to their individual constraints and capabilities. We demonstrate the effectiveness of our framework through detailed examples covering a wide range of missions, showcasing its flexibility and scalability.
中文: 本研究提出了一种异构多机器人团队的新型任务规划策略,利用分层树将复杂任务分解为子任务,通过专用API和工具由大语言模型生成这些树结构,进而分解为适应各机器人约束和能力的优化调度方案,并在多种任务中验证了其灵活性和可扩展性。
English: This study introduces a novel mission-planning strategy for heterogeneous multi-robot teams that uses hierarchical trees to break down complex missions into sub-tasks, with LLMs generating these trees through specialized APIs and tools, then decomposing them into optimized schedules tailored to each robot's constraints and capabilities, demonstrating flexibility and scalability across various missions.

Authors:Ryo Hase, Md Rafi Ur Rashid, Ashley Lewis, Jing Liu, Toshiaki Koike-Akino, Kieran Parsons, Ye Wang
Title: Smoothed Embeddings for Robust Language Models
Abstract:
Improving the safety and reliability of large language models (LLMs) is a crucial aspect of realizing trustworthy AI systems. Although alignment methods aim to suppress harmful content generation, LLMs are often still vulnerable to jailbreaking attacks that employ adversarial inputs that subvert alignment and induce harmful outputs. We propose the Randomized Embedding Smoothing and Token Aggregation (RESTA) defense, which adds random noise to the embedding vectors and performs aggregation during the generation of each output token, with the aim of better preserving semantic information. Our experiments demonstrate that our approach achieves superior robustness versus utility tradeoffs compared to the baseline defenses.
中文: 我们提出的RESTA防御通过向嵌入向量添加随机噪声并在生成过程中聚合标记,有效提升了语言模型抗越狱攻击的鲁棒性,同时保持了更好的实用性平衡。
English: The proposed RESTA defense enhances LLM safety by adding random noise to embeddings and aggregating tokens during generation, achieving improved robustness-utility tradeoffs against jailbreaking attacks.

Authors:Weihua Zheng, Xin Huang, Zhengyuan Liu, Tarun Kumar Vangani, Bowei Zou, Xiyan Tao, Yuhao Wu, Ai Ti Aw, Nancy F. Chen, Roy Ka-Wei Lee
Title: AdaMCoT: Rethinking Cross-Lingual Factual Reasoning through Adaptive Multilingual Chain-of-Thought
Abstract:
Large language models (LLMs) have shown impressive multilingual capabilities through pretraining on diverse corpora. Although these models show strong reasoning abilities, their performance varies significantly between languages due to the imbalanced distribution of training data. Existing approaches using sample-level translation for extensive multilingual pretraining and cross-lingual tuning face scalability challenges and often fail to capture nuanced reasoning processes across languages. In this paper, we introduce AdaMCOT (Adaptive Multilingual Chain-of-Thought), a framework that enhances multilingual factual reasoning by dynamically routing thought processes in intermediary "thinking languages" before generating target-language responses. AdaMCOT leverages a language-agnostic core and incorporates an adaptive, reward-based mechanism for selecting optimal reasoning pathways without requiring additional pretraining. Our comprehensive evaluation across multiple benchmarks demonstrates substantial improvements in both factual reasoning quality and cross-lingual consistency, with particularly strong performance gains in low-resource language settings. An in-depth analysis of the model's hidden states and semantic space further elucidates the underlying mechanism of our method. The results suggest that adaptive reasoning paths can effectively bridge the performance gap between high and low-resource languages while maintaining cultural and linguistic nuances.
中文: AdaMCOT是一种自适应框架,通过中间语言动态引导思维过程来提升多语言事实推理能力,无需额外预训练即可显著改善性能与跨语言一致性。
English: AdaMCOT is an adaptive framework that improves multilingual factual reasoning by dynamically routing thought processes through intermediary languages, significantly enhancing performance and cross-lingual consistency without requiring additional pretraining.

Authors:Chu Zhao, Enneng Yang, Yuliang Liang, Jianzhe Zhao, Guibing Guo, Xingwei Wang
Title: Distributionally Robust Graph Out-of-Distribution Recommendation via Diffusion Model
Abstract:
The distributionally robust optimization (DRO)-based graph neural network methods improve recommendation systems' out-of-distribution (OOD) generalization by optimizing the model's worst-case performance. However, these studies fail to consider the impact of noisy samples in the training data, which results in diminished generalization capabilities and lower accuracy. Through experimental and theoretical analysis, this paper reveals that current DRO-based graph recommendation methods assign greater weight to noise distribution, leading to model parameter learning being dominated by it. When the model overly focuses on fitting noise samples in the training data, it may learn irrelevant or meaningless features that cannot be generalized to OOD data. To address this challenge, we design a Distributionally Robust Graph model for OOD recommendation (DRGO). Specifically, our method first employs a simple and effective diffusion paradigm to alleviate the noisy effect in the latent space. Additionally, an entropy regularization term is introduced in the DRO objective function to avoid extreme sample weights in the worst-case distribution. Finally, we provide a theoretical proof of the generalization error bound of DRGO as well as a theoretical analysis of how our approach mitigates noisy sample effects, which helps to better understand the proposed framework from a theoretical perspective. We conduct extensive experiments on four datasets to evaluate the effectiveness of our framework against three typical distribution shifts, and the results demonstrate its superiority in both independently and identically distributed distributions (IID) and OOD.
中文: 本文提出DRGO分布鲁棒图模型,通过扩散范式和熵正则化缓解噪声样本影响,在独立同分布和分布外推荐场景中均展现出优越性能。
English: This paper introduces DRGO, a distributionally robust graph model that enhances OOD recommendation by mitigating noisy sample effects through a diffusion paradigm and entropy regularization, achieving superior performance in both IID and OOD scenarios.

Authors:Zhibo Tian, Ruijie Quan, Fan Ma, Kun Zhan, Yi Yang
Title: BrainGuard: Privacy-Preserving Multisubject Image Reconstructions from Brain Activities
Abstract:
Reconstructing perceived images from human brain activity forms a crucial link between human and machine learning through Brain-Computer Interfaces. Early methods primarily focused on training separate models for each individual to account for individual variability in brain activity, overlooking valuable cross-subject commonalities. Recent advancements have explored multisubject methods, but these approaches face significant challenges, particularly in data privacy and effectively managing individual variability. To overcome these challenges, we introduce BrainGuard, a privacy-preserving collaborative training framework designed to enhance image reconstruction from multisubject fMRI data while safeguarding individual privacy. BrainGuard employs a collaborative global-local architecture where individual models are trained on each subject's local data and operate in conjunction with a shared global model that captures and leverages cross-subject patterns. This architecture eliminates the need to aggregate fMRI data across subjects, thereby ensuring privacy preservation. To tackle the complexity of fMRI data, BrainGuard integrates a hybrid synchronization strategy, enabling individual models to dynamically incorporate parameters from the global model. By establishing a secure and collaborative training environment, BrainGuard not only protects sensitive brain data but also improves the image reconstructions accuracy. Extensive experiments demonstrate that BrainGuard sets a new benchmark in both high-level and low-level metrics, advancing the state-of-the-art in brain decoding through its innovative design.
中文: BrainGuard提出了一种保护隐私的协作框架,通过结合本地模型与共享的全局模型来增强多被试fMRI数据的图像重建,无需数据聚合即可提高准确性并保护个体隐私。
English: BrainGuard introduces a privacy-preserving collaborative framework that enhances image reconstruction from multi-subject fMRI data by combining local models with a shared global model, eliminating the need for data aggregation and improving accuracy.

Authors:Mohammad Ali Rezaei, Fardin Ayar, Ehsan Javanmardi, Manabu Tsukada, Mahdi Javanmardi
Title: Where Do You Go? Pedestrian Trajectory Prediction using Scene Features
Abstract:
Accurate prediction of pedestrian trajectories is crucial for enhancing the safety of autonomous vehicles and reducing traffic fatalities involving pedestrians. While numerous studies have focused on modeling interactions among pedestrians to forecast their movements, the influence of environmental factors and scene-object placements has been comparatively underexplored. In this paper, we present a novel trajectory prediction model that integrates both pedestrian interactions and environmental context to improve prediction accuracy. Our approach captures spatial and temporal interactions among pedestrians within a sparse graph framework. To account for pedestrian-scene interactions, we employ advanced image enhancement and semantic segmentation techniques to extract detailed scene features. These scene and interaction features are then fused through a cross-attention mechanism, enabling the model to prioritize relevant environmental factors that influence pedestrian movements. Finally, a temporal convolutional network processes the fused features to predict future pedestrian trajectories. Experimental results demonstrate that our method significantly outperforms existing state-of-the-art approaches, achieving ADE and FDE values of 0.252 and 0.372 meters, respectively, underscoring the importance of incorporating both social interactions and environmental context in pedestrian trajectory prediction.
中文: 本文提出了一种新颖的轨迹预测模型,通过交叉注意力机制融合行人交互与环境上下文,在预测精度上显著优于现有方法,证明了社会交互与环境因素对于行人运动预测的关键作用。
English: This paper introduces a novel trajectory prediction model that integrates pedestrian interactions and environmental context through a cross-attention mechanism, significantly outperforming existing methods and demonstrating the critical role of both social and environmental factors in accurate pedestrian movement forecasting.

Authors:Pierfrancesco Siena, Pasquale Claudio Africa, Michele Girfoglio, Gianluigi Rozza
Title: A hybrid Reduced Order Model to enforce outflow pressure boundary conditions in computational haemodynamics
Abstract:
This paper deals with the development of a Reduced-Order Model (ROM) to investigate haemodynamics in cardiovascular applications. It employs the use of Proper Orthogonal Decomposition (POD) for the computation of the basis functions and the Galerkin projection for the computation of the reduced coefficients. The main novelty of this work lies in the extension of the lifting function method, which typically is adopted for treating nonhomogeneous inlet velocity boundary conditions, to the handling of nonhomogeneous outlet boundary conditions for the pressure, representing a very delicate point in the numerical simulations of the cardiovascular system. Moreover, we incorporate a properly trained neural network in the ROM framework to approximate the mapping from the time parameter to the outflow pressure, which in the most general case is not available in closed form. We define our approach as "hybrid", because it merges physics-based elements with data-driven ones. At full order level, a Finite Volume method is employed for the discretization of the unsteady Navier-Stokes equations while a two-element Windkessel model is adopted to enforce a valuable estimation of the outflow pressure. Numerical results, firstly related to a 2D idealized blood vessel and then to a 3D patient-specific aortic arch, demonstrate that our ROM is able to accurately approximate the FOM with a significant reduction in the computational cost.
本文提出了一种混合降阶模型,通过扩展边界条件处理方法并集成神经网络进行压力映射,结合物理基础与数据驱动方法,实现了心血管血流动力学的高效模拟计算。
This paper introduces a hybrid reduced-order model combining physics-based and data-driven methods to efficiently simulate cardiovascular hemodynamics by extending boundary condition handling and integrating neural networks for pressure mapping.

Authors:Zhenghao Lin, Zihao Tang, Xiao Liu, Yeyun Gong, Yi Cheng, Qi Chen, Hang Li, Ying Xin, Ziyue Yang, Kailai Yang, Yu Yan, Xiao Liang, Shuai Lu, Yiming Huang, Zheheng Luo, Lei Qu, Xuan Feng, Yaoxiang Wang, Yuqing Xia, Feiyang Chen, Yuting Jiang, Yasen Hu, Hao Ni, Binyang Li, Guoshuai Zhao, Jui-Hao Chiang, Zhongxin Guo, Chen Lin, Kun Kuang, Wenjie Li, Yelong Shen, Jian Jiao, Peng Cheng, Mao Yang
Title: Sigma: Differential Rescaling of Query, Key and Value for Efficient Language Models
Abstract:
We introduce Sigma, an efficient large language model specialized for the system domain, empowered by a novel architecture including DiffQKV attention, and pre-trained on our meticulously collected system domain data. DiffQKV attention significantly enhances the inference efficiency of Sigma by optimizing the Query (Q), Key (K), and Value (V) components in the attention mechanism differentially, based on their varying impacts on the model performance and efficiency indicators. Specifically, we (1) conduct extensive experiments that demonstrate the model's varying sensitivity to the compression of K and V components, leading to the development of differentially compressed KV, and (2) propose augmented Q to expand the Q head dimension, which enhances the model's representation capacity with minimal impacts on the inference speed. Rigorous theoretical and empirical analyses reveal that DiffQKV attention significantly enhances efficiency, achieving up to a 33.36% improvement in inference speed over the conventional grouped-query attention (GQA) in long-context scenarios. We pre-train Sigma on 6T tokens from various sources, including 19.5B system domain data that we carefully collect and 1T tokens of synthesized and rewritten data. In general domains, Sigma achieves comparable performance to other state-of-arts models. In the system domain, we introduce the first comprehensive benchmark AIMicius, where Sigma demonstrates remarkable performance across all tasks, significantly outperforming GPT-4 with an absolute improvement up to 52.5%.
中文: Sigma是一款专用于系统领域的高效大语言模型,采用创新的DiffQKV注意力机制,通过差异化优化Q、K、V组件,在长上下文场景中推理速度比传统GQA提升高达33.36%,并在首个系统领域综合基准AIMicius上所有任务表现卓越,相对GPT-4绝对性能提升最高达52.5%。
English: Sigma is an efficient large language model for the system domain, utilizing the innovative DiffQKV attention mechanism that differentially optimizes Q, K, and V components to boost inference speed by up to 33.36% over GQA while achieving superior performance in system-specific tasks, even surpassing GPT-4 by up to 52.5% on the AIMicius benchmark.

Authors:Chen Chen, Xinlong Hao, Weiwen Liu, Xu Huang, Xingshan Zeng, Shuai Yu, Dexun Li, Shuai Wang, Weinan Gan, Yuefeng Huang, Wulong Liu, Xinzhi Wang, Defu Lian, Baoqun Yin, Yasheng Wang, Wu Liu
Title: ACEBench: Who Wins the Match Point in Tool Usage?
Abstract:
Large Language Models (LLMs) have demonstrated significant potential in decision-making and reasoning, particularly when integrated with various tools to effectively solve complex problems. However, existing benchmarks for evaluating LLMs' tool usage face several limitations: (1) limited evaluation scenarios, often lacking assessments in real multi-turn dialogue contexts; (2) narrow evaluation dimensions, with insufficient detailed assessments of how LLMs use tools; and (3) reliance on LLMs or real API executions for evaluation, which introduces significant overhead. To address these challenges, we introduce ACEBench, a comprehensive benchmark for assessing tool usage in LLMs. ACEBench categorizes data into three primary types based on evaluation methodology: Normal, Special, and Agent. "Normal" evaluates tool usage in basic scenarios; "Special" evaluates tool usage in situations with ambiguous or incomplete instructions; "Agent" evaluates tool usage through multi-agent interactions to simulate real-world, multi-turn dialogues. We conducted extensive experiments using ACEBench, analyzing various LLMs in-depth and providing a more granular examination of error causes across different data types.
Chinese: 大语言模型在结合工具进行决策时潜力显著,但现有评估基准存在场景局限、维度不足及执行开销大的问题,为此我们提出ACEBench基准,通过常规、特殊和智能体三类数据全面评估工具使用能力。
English: Large Language Models show promise in decision-making when using tools, but current benchmarks are limited by narrow scenarios, insufficient evaluation details, and high overhead, prompting the introduction of ACEBench to comprehensively assess tool usage across normal, special, and agent-based contexts.

Authors:Qiuxia Wu, Haiyang Huang, Kunming Su, Zhiyong Wang, Kun Hu
Title: DC-PCN: Point Cloud Completion Network with Dual-Codebook Guided Quantization
Abstract:
Point cloud completion aims to reconstruct complete 3D shapes from partial 3D point clouds. With advancements in deep learning techniques, various methods for point cloud completion have been developed. Despite achieving encouraging results, a significant issue remains: these methods often overlook the variability in point clouds sampled from a single 3D object surface. This variability can lead to ambiguity and hinder the achievement of more precise completion results. Therefore, in this study, we introduce a novel point cloud completion network, namely Dual-Codebook Point Completion Network (DC-PCN), following an encder-decoder pipeline. The primary objective of DC-PCN is to formulate a singular representation of sampled point clouds originating from the same 3D surface. DC-PCN introduces a dual-codebook design to quantize point-cloud representations from a multilevel perspective. It consists of an encoder-codebook and a decoder-codebook, designed to capture distinct point cloud patterns at shallow and deep levels. Additionally, to enhance the information flow between these two codebooks, we devise an information exchange mechanism. This approach ensures that crucial features and patterns from both shallow and deep levels are effectively utilized for completion. Extensive experiments on the PCN, ShapeNet\_Part, and ShapeNet34 datasets demonstrate the state-of-the-art performance of our method.
中文: 本研究提出DC-PCN双码本点云补全网络,通过信息交换机制从多层级捕捉点云模式以解决采样点云的变异性,在基准数据集上实现了最先进的性能。
English: This study introduces DC-PCN, a dual-codebook point cloud completion network that addresses variability in sampled point clouds by capturing multilevel patterns through an information exchange mechanism, achieving state-of-the-art performance on benchmark datasets.

Authors:Alexandru Dimofte, Glenn Anta Bucagu, Thorir Mar Ingolfsson, Xiaying Wang, Andrea Cossettini, Luca Benini, Yawei Li
Title: CEReBrO: Compact Encoder for Representations of Brain Oscillations Using Efficient Alternating Attention
Abstract:
Electroencephalograph (EEG) is a crucial tool for studying brain activity. Recently, self-supervised learning methods leveraging large unlabeled datasets have emerged as a potential solution to the scarcity of widely available annotated EEG data. However, current methods suffer from at least one of the following limitations: i) sub-optimal EEG signal modeling, ii) model sizes in the hundreds of millions of trainable parameters, and iii) reliance on private datasets and/or inconsistent public benchmarks, hindering reproducibility. To address these challenges, we introduce a Compact Encoder for Representations of Brain Oscillations using alternating attention (CEReBrO), a new small EEG foundation model. Our tokenization scheme represents EEG signals at a per-channel patch granularity. We propose an alternating attention mechanism that jointly models intra-channel temporal dynamics and inter-channel spatial correlations, achieving 2x speed improvement with 6x less memory required compared to standard self-attention. We present several model sizes ranging from 3.6 million to 85 million parameters. Pre-trained on over 20,000 hours of publicly available scalp EEG recordings with diverse channel configurations, our models set new benchmarks in emotion detection and seizure detection tasks, with competitive performance in anomaly classification and gait prediction. This validates our models' effectiveness and efficiency.
Chinese: 自监督学习虽解决脑电图数据稀缺问题,但现有方法在信号建模、模型规模和可复现性方面存在不足;新型紧凑型CEReBrO模型通过高效交替注意力机制克服了这些限制,并在多项任务中取得领先性能。
English: Self-supervised learning addresses EEG data scarcity, but current methods face limitations in signal modeling, model size, and reproducibility, which the new compact CEReBrO model overcomes with efficient alternating attention and achieves top performance in multiple tasks.

Authors:Tuan L. Vo, Quan Huu Do, Uyen Dang, Thu Nguyen, PÃ¥l Halvorsen, Michael A. Riegler, Binh T. Nguyen
Title: DPERC: Direct Parameter Estimation for Mixed Data
Abstract:
The covariance matrix is a foundation in numerous statistical and machine-learning applications such as Principle Component Analysis, Correlation Heatmap, etc. However, missing values within datasets present a formidable obstacle to accurately estimating this matrix. While imputation methods offer one avenue for addressing this challenge, they often entail a trade-off between computational efficiency and estimation accuracy. Consequently, attention has shifted towards direct parameter estimation, given its precision and reduced computational burden. In this paper, we propose Direct Parameter Estimation for Randomly Missing Data with Categorical Features (DPERC), an efficient approach for direct parameter estimation tailored to mixed data that contains missing values within continuous features. Our method is motivated by leveraging information from categorical features, which can significantly enhance covariance matrix estimation for continuous features. Our approach effectively harnesses the information embedded within mixed data structures. Through comprehensive evaluations of diverse datasets, we demonstrate the competitive performance of DPERC compared to various contemporary techniques. In addition, we also show by experiments that DPERC is a valuable tool for visualizing the correlation heatmap.
中文摘要:本文提出的DPERC方法通过利用分类特征信息,能有效处理连续特征缺失的混合数据直接参数估计问题,经多种数据集验证具有优越性能,并可作为相关热图可视化的实用工具。
English Summary: The proposed DPERC method enables efficient direct parameter estimation for mixed datasets with missing continuous features by leveraging categorical feature information, demonstrating competitive performance and utility for correlation visualization across diverse evaluations.

Authors:Paul Röttger, Giuseppe Attanasio, Felix Friedrich, Janis Goldzycher, Alicia Parrish, Rishabh Bhardwaj, Chiara Di Bonaventura, Roman Eng, Gaia El Khoury Geagea, Sujata Goswami, Jieun Han, Dirk Hovy, Seogyeong Jeong, Paloma Jeretič, Flor Miriam Plaza-del-Arco, Donya Rooein, Patrick Schramowski, Anastassia Shaitarova, Xudong Shen, Richard Willats, Andrea Zugarini, Bertie Vidgen
Title: MSTS: A Multimodal Safety Test Suite for Vision-Language Models
Abstract:
Vision-language models (VLMs), which process image and text inputs, are increasingly integrated into chat assistants and other consumer AI applications. Without proper safeguards, however, VLMs may give harmful advice (e.g. how to self-harm) or encourage unsafe behaviours (e.g. to consume drugs). Despite these clear hazards, little work so far has evaluated VLM safety and the novel risks created by multimodal inputs. To address this gap, we introduce MSTS, a Multimodal Safety Test Suite for VLMs. MSTS comprises 400 test prompts across 40 fine-grained hazard categories. Each test prompt consists of a text and an image that only in combination reveal their full unsafe meaning. With MSTS, we find clear safety issues in several open VLMs. We also find some VLMs to be safe by accident, meaning that they are safe because they fail to understand even simple test prompts. We translate MSTS into ten languages, showing non-English prompts to increase the rate of unsafe model responses. We also show models to be safer when tested with text only rather than multimodal prompts. Finally, we explore the automation of VLM safety assessments, finding even the best safety classifiers to be lacking.
中文: 视觉语言模型(VLMs)存在安全隐患,可能提供有害建议,因此开发了MSTS测试套件,揭示这些模型在多模态输入和非英语提示下的安全漏洞。
English: Vision-language models (VLMs) pose significant safety risks by potentially providing harmful advice, prompting the creation of MSTS, a test suite that reveals vulnerabilities in these models, especially with multimodal inputs and non-English prompts.

Authors:Wenhan Wang, Xuan Xie, Yuheng Huang, Renzhi Wang, An Ran Chen, Lei Ma
Title: Fine-grained Testing for Autonomous Driving Software: a Study on Autoware with LLM-driven Unit Testing
Abstract:
Testing autonomous driving systems (ADS) is critical to ensuring their reliability and safety. Existing ADS testing works focuses on designing scenarios to evaluate system-level behaviors, while fine-grained testing of ADS source code has received comparatively little attention. To address this gap, we present the first study on testing, specifically unit testing, for ADS source code. Our study focuses on an industrial ADS framework, Autoware. We analyze both human-written test cases and those generated by large language models (LLMs). Our findings reveal that human-written test cases in Autoware exhibit limited test coverage, and significant challenges remain in applying LLM-generated tests for Autoware unit testing. To overcome these challenges, we propose AwTest-LLM, a novel approach to enhance test coverage and improve test case pass rates across Autoware packages.
中文: 本研究首次对自动驾驶系统源代码的单元测试进行探索,揭示了Autoware中人工编写测试覆盖率不足及大语言模型生成测试的挑战,并提出AwTest-LLM方法以提升测试覆盖率和通过率。
English: This study introduces the first investigation into unit testing for autonomous driving system (ADS) source code, revealing limited coverage in human-written tests and challenges with LLM-generated tests in Autoware, and proposes AwTest-LLM to enhance coverage and pass rates.

Authors:Nuo Chen, Quanyu Dai, Xiaoyu Dong, Piaohong Wang, Qinglin Jia, Zhaocheng Du, Zhenhua Dong, Xiao-Ming Wu
Title: Evaluating Conversational Recommender Systems via Large Language Models: A User-Centric Framework
Abstract:
Conversational recommender systems (CRSs) integrate both recommendation and dialogue tasks, making their evaluation uniquely challenging. Existing approaches primarily assess CRS performance by separately evaluating item recommendation and dialogue management using rule-based metrics. However, these methods fail to capture the real human experience, and they cannot draw direct conclusions about the system's overall performance. As conversational recommender systems become increasingly vital in e-commerce, social media, and customer support, the ability to evaluate both recommendation accuracy and dialogue management quality using a single metric, thereby authentically reflecting user experience, has become the principal challenge impeding progress in this field. In this work, we propose a user-centric evaluation framework based on large language models (LLMs) for CRSs, namely Conversational Recommendation Evaluator (CoRE). CoRE consists of two main components: (1) LLM-As-Evaluator. Firstly, we comprehensively summarize 12 key factors influencing user experience in CRSs and directly leverage LLM as an evaluator to assign a score to each factor. (2) Multi-Agent Debater. Secondly, we design a multi-agent debate framework with four distinct roles (common user, domain expert, linguist, and HCI expert) to discuss and synthesize the 12 evaluation factors into a unified overall performance score. Furthermore, we apply the proposed framework to evaluate four CRSs on two benchmark datasets. The experimental results show that CoRE aligns well with human evaluation in most of the 12 factors and the overall assessment. Especially, CoRE's overall evaluation scores demonstrate significantly better alignment with human feedback compared to existing rule-based metrics.
中文摘要:本文提出CoRE框架,利用大语言模型通过评估12个关键用户体验因素并结合多智能体辩论进行综合评分,实验证明该框架在整体评估和多数指标上比传统规则指标更贴合人类评价。
English Summary: This paper introduces CoRE, a user-centric evaluation framework using large language models to assess conversational recommender systems by scoring 12 key user experience factors and synthesizing them through multi-agent debates, demonstrating superior alignment with human evaluation compared to traditional metrics.

Authors:Yukai Ma, Tiantian Wei, Naiting Zhong, Jianbiao Mei, Tao Hu, Licheng Wen, Xuemeng Yang, Botian Shi, Yong Liu
Title: LeapVAD: A Leap in Autonomous Driving via Cognitive Perception and Dual-Process Thinking
Abstract:
While autonomous driving technology has made remarkable strides, data-driven approaches still struggle with complex scenarios due to their limited reasoning capabilities. Meanwhile, knowledge-driven autonomous driving systems have evolved considerably with the popularization of visual language models. In this paper, we propose LeapVAD, a novel method based on cognitive perception and dual-process thinking. Our approach implements a human-attentional mechanism to identify and focus on critical traffic elements that influence driving decisions. By characterizing these objects through comprehensive attributes - including appearance, motion patterns, and associated risks - LeapVAD achieves more effective environmental representation and streamlines the decision-making process. Furthermore, LeapVAD incorporates an innovative dual-process decision-making module miming the human-driving learning process. The system consists of an Analytic Process (System-II) that accumulates driving experience through logical reasoning and a Heuristic Process (System-I) that refines this knowledge via fine-tuning and few-shot learning. LeapVAD also includes reflective mechanisms and a growing memory bank, enabling it to learn from past mistakes and continuously improve its performance in a closed-loop environment. To enhance efficiency, we develop a scene encoder network that generates compact scene representations for rapid retrieval of relevant driving experiences. Extensive evaluations conducted on two leading autonomous driving simulators, CARLA and DriveArena, demonstrate that LeapVAD achieves superior performance compared to camera-only approaches despite limited training data. Comprehensive ablation studies further emphasize its effectiveness in continuous learning and domain adaptation. Project page: https://pjlab-adg.github.io/LeapVAD/.
中文: 本文提出LeapVAD新型自动驾驶方法,通过认知感知和双过程思维机制,聚焦关键交通要素进行决策优化,并借助经验积累和反思学习实现持续改进,在有限训练数据下于仿真环境中展现出优越性能。
English: This paper introduces LeapVAD, a novel autonomous driving method that employs cognitive perception and dual-process thinking to enhance decision-making by focusing on critical traffic elements and continuously improving through experience accumulation and reflective learning, demonstrating superior performance in simulations with limited data.

Authors:Yuji Chai, Mujin Kwen, David Brooks, Gu-Yeon Wei
Title: FlexQuant: Elastic Quantization Framework for Locally Hosted LLM on Edge Devices
Abstract:
Deploying LLMs on edge devices presents serious technical challenges. Memory elasticity is crucial for edge devices with unified memory, where memory is shared and fluctuates dynamically. Existing solutions suffer from either poor transition granularity or high storage costs. We propose FlexQuant, a novel elasticity framework that generates an ensemble of quantized models, providing an elastic hosting solution with 15x granularity improvement and 10x storage reduction compared to SoTA methods. FlexQuant works with most quantization methods and creates a family of trade-off options under various storage limits through our pruning method. It brings great performance and flexibility to the edge deployment of LLMs.
中文摘要:FlexQuant是一种创新的弹性框架,通过量化模型集成与剪枝技术,显著提升了边缘设备上大语言模型的部署效率,实现了更优的粒度控制与存储优化。
English Summary: FlexQuant is an innovative elasticity framework that enhances LLM deployment on edge devices by offering improved granularity and reduced storage through an ensemble of quantized models and pruning techniques.

Authors:Hanjiang Hong, Kai-Kit Wong, Hao Xu, Yin Xu, Hyundong Shin, Ross Murch, Dazhi He, Wenjun Zhang
Title: Downlink OFDM-FAMA in 5G-NR Systems
Abstract:
Fluid antenna multiple access (FAMA), enabled by the fluid antenna system (FAS), offers a new and straightforward solution to massive connectivity. Previous results on FAMA were primarily based on narrowband channels. This paper studies the adoption of FAMA within the fifth-generation (5G) orthogonal frequency division multiplexing (OFDM) framework, referred to as OFDM-FAMA, and evaluate its performance in broadband multipath channels. We first design the OFDM-FAMA system, taking into account 5G channel coding and OFDM modulation. Then the system's achievable rate is analyzed, and an algorithm to approximate the FAS configuration at each user is proposed based on the rate. Extensive link-level simulation results reveal that OFDM-FAMA can significantly improve the multiplexing gain over the OFDM system with fixed-position antenna (FPA) users, especially when robust channel coding is applied and the number of radio-frequency (RF) chains at each user is small.
中文: OFDM-FAMA将流体天线多址接入技术融入5G正交频分复用系统,通过优化天线配置和强健信道编码,在宽带多径信道中显著提升了复用增益。
English: OFDM-FAMA, integrating fluid antenna multiple access into 5G OFDM systems, enhances multiplexing gain in broadband channels through optimized antenna configuration and robust channel coding.

Authors:Zhenyu Lei, Yushun Dong, Weiyu Li, Rong Ding, Qi Wang, Jundong Li
Title: Harnessing Large Language Models for Disaster Management: A Survey
Abstract:
Large language models (LLMs) have revolutionized scientific research with their exceptional capabilities and transformed various fields. Among their practical applications, LLMs have been playing a crucial role in mitigating threats to human life, infrastructure, and the environment. Despite growing research in disaster LLMs, there remains a lack of systematic review and in-depth analysis of LLMs for natural disaster management. To address the gap, this paper presents a comprehensive survey of existing LLMs in natural disaster management, along with a taxonomy that categorizes existing works based on disaster phases and application scenarios. By collecting public datasets and identifying key challenges and opportunities, this study aims to guide the professional community in developing advanced LLMs for disaster management to enhance the resilience against natural disasters.
Chinese: 本文对应用于自然灾害管理的大语言模型进行了系统综述与分类,通过分析现有成果、数据集及关键挑战,旨在指导开发更先进的灾害管理大语言模型以增强灾害应对能力。
English: This paper provides a systematic survey and taxonomy of large language models (LLMs) applied to natural disaster management, aiming to guide the development of advanced disaster LLMs by analyzing existing works, datasets, and key challenges.

Authors:Qing Wang, Jixun Yao, Zhaokai Sun, Pengcheng Guo, Lei Xie, John H. L. Hansen
Title: DiffAttack: Diffusion-based Timbre-reserved Adversarial Attack in Speaker Identification
Abstract:
Being a form of biometric identification, the security of the speaker identification (SID) system is of utmost importance. To better understand the robustness of SID systems, we aim to perform more realistic attacks in SID, which are challenging for both humans and machines to detect. In this study, we propose DiffAttack, a novel timbre-reserved adversarial attack approach that exploits the capability of a diffusion-based voice conversion (DiffVC) model to generate adversarial fake audio with distinct target speaker attribution. By introducing adversarial constraints into the generative process of the diffusion-based voice conversion model, we craft fake samples that effectively mislead target models while preserving speaker-wise characteristics. Specifically, inspired by the use of randomly sampled Gaussian noise in conventional adversarial attacks and diffusion processes, we incorporate adversarial constraints into the reverse diffusion process. These constraints subtly guide the reverse diffusion process toward aligning with the target speaker distribution. Our experiments on the LibriTTS dataset indicate that DiffAttack significantly improves the attack success rate compared to vanilla DiffVC and other methods. Moreover, objective and subjective evaluations demonstrate that introducing adversarial constraints does not compromise the speech quality generated by the DiffVC model.
中文摘要:本研究提出DiffAttack,一种基于扩散语音转换的新型对抗攻击方法,能生成保留说话人音色的欺骗性音频,在有效误导说话人识别系统的同时不损害语音质量,显著提高了攻击成功率。
English Summary: This study introduces DiffAttack, a novel adversarial attack method using diffusion-based voice conversion to create deceptive audio that preserves speaker timbre while effectively misleading speaker identification systems, achieving higher success rates without compromising speech quality.

Authors:Hyeonsoo Jo, Hyunjin Hwang, Fanchen Bu, Soo Yong Lee, Chanyoung Park, Kijung Shin
Title: On Measuring Unnoticeability of Graph Adversarial Attacks: Observations, New Measure, and Applications
Abstract:
Adversarial attacks are allegedly unnoticeable. Prior studies have designed attack noticeability measures on graphs, primarily using statistical tests to compare the topology of original and (possibly) attacked graphs. However, we observe two critical limitations in the existing measures. First, because the measures rely on simple rules, attackers can readily enhance their attacks to bypass them, reducing their attack "noticeability" and, yet, maintaining their attack performance. Second, because the measures naively leverage global statistics, such as degree distributions, they may entirely overlook attacks until severe perturbations occur, letting the attacks be almost "totally unnoticeable." To address the limitations, we introduce HideNSeek, a learnable measure for graph attack noticeability. First, to mitigate the bypass problem, HideNSeek learns to distinguish the original and (potential) attack edges using a learnable edge scorer (LEO), which scores each edge on its likelihood of being an attack. Second, to mitigate the overlooking problem, HideNSeek conducts imbalance-aware aggregation of all the edge scores to obtain the final noticeability score. Using six real-world graphs, we empirically demonstrate that HideNSeek effectively alleviates the observed limitations, and LEO (i.e., our learnable edge scorer) outperforms eleven competitors in distinguishing attack edges under five different attack methods. For an additional application, we show that LEO boost the performance of robust GNNs by removing attack-like edges.
Chinese Summary: 现有图对抗攻击检测方法因依赖简单规则和全局统计而存在易被绕过和忽略细微攻击的局限,为此提出的HideNSeek通过可学习边评分器和不平衡感知聚合,有效识别攻击边并提升检测性能。
English Summary: Existing measures for detecting adversarial attacks on graphs are limited by their simplicity and reliance on global statistics, making them easily bypassed and slow to detect subtle attacks, leading to the introduction of HideNSeek, a learnable measure that effectively identifies attack edges and improves detection through imbalance-aware scoring.

Authors:Anna Ivagnes, Maria Strazzullo, Michele Girfoglio, Traian Iliescu, Gianluigi Rozza
Title: Data-driven Optimization for the Evolve-Filter-Relax regularization of convection-dominated flows
Abstract:
Numerical stabilization techniques are often employed in under-resolved simulations of convection-dominated flows to improve accuracy and mitigate spurious oscillations. Specifically, the evolve--filter--relax (EFR) algorithm is a framework which consists in evolving the solution, applying a filtering step to remove high-frequency noise, and relaxing through a convex combination of filtered and original solutions. The stability and accuracy of the EFR solution strongly depend on two parameters, the filter radius $δ$ and the relaxation parameter $χ$. Standard choices for these parameters are usually fixed in time, and related to the full order model setting, i.e., the grid size for $δ$ and the time step for $χ$. The key novelties with respect to the standard EFR approach are: (i) time-dependent parameters $δ(t)$ and $χ(t)$, and (ii) data-driven adaptive optimization of the parameters in time, considering a fully-resolved simulation as reference. In particular, we propose three different classes of optimized-EFR (Opt-EFR) strategies, aiming to optimize one or both parameters. The new Opt-EFR strategies are tested in the under-resolved simulation of a turbulent flow past a cylinder at $Re=1000$. The Opt-EFR proved to be more accurate than standard approaches by up to 99$\%$, while maintaining a similar computational time. In particular, the key new finding of our analysis is that such accuracy can be obtained only if the optimized objective function includes: (i) a global metric (as the kinetic energy), and (ii) spatial gradients' information.
中文: 本研究在演化-滤波-松弛算法中引入了时变参数和数据驱动的自适应优化方法,通过湍流圆柱绕流算例验证,优化后的策略在保持计算效率的同时,将数值模拟精度提升高达99%,并揭示包含动能和空间梯度信息的优化目标函数是实现高精度的关键。
English: The study introduces time-dependent and data-driven adaptive optimization of parameters in the evolve-filter-relax algorithm, demonstrating that optimized strategies significantly enhance accuracy by up to 99% in under-resolved turbulent flow simulations while maintaining computational efficiency.

Authors:Haotian Li, Lu Ying, Leixian Shen, Yun Wang, Yingcai Wu, Huamin Qu
Title: Composing Data Stories with Meta Relations
Abstract:
To facilitate the creation of compelling and engaging data stories, AI-powered tools have been introduced to automate the three stages in the workflow: analyzing data, organizing findings, and creating visuals. However, these tools rely on data-level information to derive inflexible relations between findings. Therefore, they often create one-size-fits-all data stories. Differently, our formative study reveals that humans heavily rely on meta relations between these findings from diverse domain knowledge and narrative intent, going beyond datasets, to compose their findings into stylized data stories. Such a gap indicates the importance of introducing meta relations to elevate AI-created stories to a satisfactory level. Though necessary, it is still unclear where and how AI should be involved in working with humans on meta relations. To answer the question, we conducted an exploratory user study with Remex, an AI-powered data storytelling tool that suggests meta relations in the analysis stage and applies meta relations for data story organization. The user study reveals various findings about introducing AI for meta relations into the storytelling workflow, such as the benefit of considering meta relations and their diverse expected usage scenarios. Finally, the paper concludes with lessons and suggestions about applying meta relations to compose data stories to hopefully inspire future research.
中文摘要:当前AI工具仅依赖数据层面的关系生成通用数据故事,而人类讲故事者则利用领域知识和叙事意图中的元关系来创作风格化故事,这表明AI需引入元关系以提升数据故事的质量。
English Summary: AI tools currently create generic data stories by relying solely on data-level relations, but human storytellers use meta relations from domain knowledge and narrative intent to craft stylized stories, highlighting the need for AI to incorporate these meta relations for improved storytelling.

Authors:Zhenglai Li, Jun Wang, Chang Tang, Xinzhong Zhu, Wei Zhang, Xinwang Liu
Title: Balanced Multi-view Clustering
Abstract:
Multi-view clustering (MvC) aims to integrate information from different views to enhance the capability of the model in capturing the underlying data structures. The widely used joint training paradigm in MvC is potentially not fully leverage the multi-view information, since the imbalanced and under-optimized view-specific features caused by the uniform learning objective for all views. For instance, particular views with more discriminative information could dominate the learning process in the joint training paradigm, leading to other views being under-optimized. To alleviate this issue, we first analyze the imbalanced phenomenon in the joint-training paradigm of multi-view clustering from the perspective of gradient descent for each view-specific feature extractor. Then, we propose a novel balanced multi-view clustering (BMvC) method, which introduces a view-specific contrastive regularization (VCR) to modulate the optimization of each view. Concretely, VCR preserves the sample similarities captured from the joint features and view-specific ones into the clustering distributions corresponding to view-specific features to enhance the learning process of view-specific feature extractors. Additionally, a theoretical analysis is provided to illustrate that VCR adaptively modulates the magnitudes of gradients for updating the parameters of view-specific feature extractors to achieve a balanced multi-view learning procedure. In such a manner, BMvC achieves a better trade-off between the exploitation of view-specific patterns and the exploration of view-invariance patterns to fully learn the multi-view information for the clustering task. Finally, a set of experiments are conducted to verify the superiority of the proposed method compared with state-of-the-art approaches on eight benchmark MvC datasets.
中文: 提出的平衡多视图聚类方法通过引入视图特定对比正则化来调节梯度幅度,解决了联合训练中的不平衡问题,在基准数据集上实现了优越性能。
English: The proposed Balanced Multi-view Clustering (BMvC) method addresses the imbalance in joint training by introducing a view-specific contrastive regularization to modulate gradient magnitudes, achieving superior performance on benchmark datasets.

Authors:Minoo Dolatabadi, Fardin Ayar, Ehsan Javanmardi, Manabu Tsukada, Mahdi Javanmardi
Title: Neural Error Covariance Estimation for Precise LiDAR Localization
Abstract:
Autonomous vehicles have gained significant attention due to technological advancements and their potential to transform transportation. A critical challenge in this domain is precise localization, particularly in LiDAR-based map matching, which is prone to errors due to degeneracy in the data. Most sensor fusion techniques, such as the Kalman filter, rely on accurate error covariance estimates for each sensor to improve localization accuracy. However, obtaining reliable covariance values for map matching remains a complex task. To address this challenge, we propose a neural network-based framework for predicting localization error covariance in LiDAR map matching. To achieve this, we introduce a novel dataset generation method specifically designed for error covariance estimation. In our evaluation using a Kalman filter, we achieved a 2 cm improvement in localization accuracy, a significant enhancement in this domain.
Chinese: 该研究提出了一种基于神经网络的框架,用于预测激光雷达地图匹配中的定位误差协方差,并通过创新的数据集生成方法,在卡尔曼滤波器评估中将定位精度提高了2厘米。
English: The study introduces a neural network framework to predict localization error covariance in LiDAR map matching, using a novel dataset generation method that improved localization accuracy by 2 cm in Kalman filter evaluations.

Authors:Xiyuan Gao, Bing Cao, Pengfei Zhu, Nannan Wang, Qinghua Hu
Title: Asymmetric Reinforcing against Multi-modal Representation Bias
Abstract:
The strength of multimodal learning lies in its ability to integrate information from various sources, providing rich and comprehensive insights. However, in real-world scenarios, multi-modal systems often face the challenge of dynamic modality contributions, the dominance of different modalities may change with the environments, leading to suboptimal performance in multimodal learning. Current methods mainly enhance weak modalities to balance multimodal representation bias, which inevitably optimizes from a partialmodality perspective, easily leading to performance descending for dominant modalities. To address this problem, we propose an Asymmetric Reinforcing method against Multimodal representation bias (ARM). Our ARM dynamically reinforces the weak modalities while maintaining the ability to represent dominant modalities through conditional mutual information. Moreover, we provide an in-depth analysis that optimizing certain modalities could cause information loss and prevent leveraging the full advantages of multimodal data. By exploring the dominance and narrowing the contribution gaps between modalities, we have significantly improved the performance of multimodal learning, making notable progress in mitigating imbalanced multimodal learning.
中文: 提出的非对称增强方法(ARM)通过条件互信息动态强化弱势模态并保持强势模态能力,通过缩小模态间贡献差距有效解决了多模态学习不平衡问题。
English: The proposed Asymmetric Reinforcing method (ARM) dynamically strengthens weak modalities while preserving dominant ones through conditional mutual information, effectively addressing imbalanced multimodal learning by narrowing contribution gaps between modalities.

Authors:Tadahiro Taniguchi, Ryo Ueda, Tomoaki Nakamura, Masahiro Suzuki, Akira Taniguchi
Title: Generative Emergent Communication: Large Language Model is a Collective World Model
Abstract:
Large Language Models (LLMs) have demonstrated a remarkable ability to capture extensive world knowledge, yet how this is achieved without direct sensorimotor experience remains a fundamental puzzle. This study proposes a novel theoretical solution by introducing the Collective World Model hypothesis. We argue that an LLM does not learn a world model from scratch; instead, it learns a statistical approximation of a collective world model that is already implicitly encoded in human language through a society-wide process of embodied, interactive sense-making. To formalize this process, we introduce generative emergent communication (Generative EmCom), a framework built on the Collective Predictive Coding (CPC). This framework models the emergence of language as a process of decentralized Bayesian inference over the internal states of multiple agents. We argue that this process effectively creates an encoder-decoder structure at a societal scale: human society collectively encodes its grounded, internal representations into language, and an LLM subsequently decodes these symbols to reconstruct a latent space that mirrors the structure of the original collective representations. This perspective provides a principled, mathematical explanation for how LLMs acquire their capabilities. The main contributions of this paper are: 1) the formalization of the Generative EmCom framework, clarifying its connection to world models and multi-agent reinforcement learning, and 2) its application to interpret LLMs, explaining phenomena such as distributional semantics as a natural consequence of representation reconstruction. This work provides a unified theory that bridges individual cognitive development, collective language evolution, and the foundations of large-scale AI.
中文摘要:本研究提出集体世界模型假说,认为大语言模型并非从零学习世界模型,而是通过生成式涌现通信框架,对人类语言中已编码的社会性感知所形成的集体世界模型进行统计近似重建。
English Summary: This study proposes the Collective World Model hypothesis, suggesting that LLMs learn a statistical approximation of a world model already embedded in human language through societal sense-making, formalized via a generative emergent communication framework.

Authors:Wonjun Lee, Solee Im, Heejin Do, Yunsu Kim, Jungseul Ok, Gary Geunbae Lee
Title: DyPCL: Dynamic Phoneme-level Contrastive Learning for Dysarthric Speech Recognition
Abstract:
Dysarthric speech recognition often suffers from performance degradation due to the intrinsic diversity of dysarthric severity and extrinsic disparity from normal speech. To bridge these gaps, we propose a Dynamic Phoneme-level Contrastive Learning (DyPCL) method, which leads to obtaining invariant representations across diverse speakers. We decompose the speech utterance into phoneme segments for phoneme-level contrastive learning, leveraging dynamic connectionist temporal classification alignment. Unlike prior studies focusing on utterance-level embeddings, our granular learning allows discrimination of subtle parts of speech. In addition, we introduce dynamic curriculum learning, which progressively transitions from easy negative samples to difficult-to-distinguishable negative samples based on phonetic similarity of phoneme. Our approach to training by difficulty levels alleviates the inherent variability of speakers, better identifying challenging speeches. Evaluated on the UASpeech dataset, DyPCL outperforms baseline models, achieving an average 22.10\% relative reduction in word error rate (WER) across the overall dysarthria group.
中文摘要:提出的动态音素级对比学习方法通过细粒度音素分析和动态课程学习,有效克服了构音障碍语音识别的多样性挑战,在UASpeech数据集上实现了22.10%的词错误率相对降低。
English Summary: The proposed Dynamic Phoneme-level Contrastive Learning (DyPCL) method addresses dysarthric speech recognition challenges by creating invariant representations through granular phoneme analysis and dynamic curriculum learning, achieving a 22.10% relative WER reduction on the UASpeech dataset.

Authors:Zhe Wang, Xiaoliang Huo, Siqi Fan, Jingjing Liu, Ya-Qin Zhang, Yan Wang
Title: IROAM: Improving Roadside Monocular 3D Object Detection Learning from Autonomous Vehicle Data Domain
Abstract:
In autonomous driving, The perception capabilities of the ego-vehicle can be improved with roadside sensors, which can provide a holistic view of the environment. However, existing monocular detection methods designed for vehicle cameras are not suitable for roadside cameras due to viewpoint domain gaps. To bridge this gap and Improve ROAdside Monocular 3D object detection, we propose IROAM, a semantic-geometry decoupled contrastive learning framework, which takes vehicle-side and roadside data as input simultaneously. IROAM has two significant modules. In-Domain Query Interaction module utilizes a transformer to learn content and depth information for each domain and outputs object queries. Cross-Domain Query Enhancement To learn better feature representations from two domains, Cross-Domain Query Enhancement decouples queries into semantic and geometry parts and only the former is used for contrastive learning. Experiments demonstrate the effectiveness of IROAM in improving roadside detector's performance. The results validate that IROAM has the capabilities to learn cross-domain information.
中文摘要:IROAM通过语义-几何解耦的对比学习框架,有效弥合车载与路侧摄像头的视角差异,提升了路侧单目三维物体检测的性能。
English Summary: IROAM enhances roadside monocular 3D object detection by decoupling semantic and geometric features through contrastive learning, effectively bridging the viewpoint gap between vehicle and roadside cameras.

Authors:Aleksandar Petrov, Shruti Agarwal, Philip H. S. Torr, Adel Bibi, John Collomosse
Title: On the Coexistence and Ensembling of Watermarks
Abstract:
Watermarking, the practice of embedding imperceptible information into media such as images, videos, audio, and text, is essential for intellectual property protection, content provenance and attribution. The growing complexity of digital ecosystems necessitates watermarks for different uses to be embedded in the same media. However, to detect and decode all watermarks, they need to coexist well with one another. We perform the first study of coexistence of deep image watermarking methods and, contrary to intuition, we find that various open-source watermarks can coexist with only minor impacts on image quality and decoding robustness. The coexistence of watermarks also opens the avenue for ensembling watermarking methods. We show how ensembling can increase the overall message capacity and enable new trade-offs between capacity, accuracy, robustness and image quality, without needing to retrain the base models.
中文: 深度图像水印能够共存,对图像质量和鲁棒性影响微小,且通过集成方法无需重新训练模型即可提升容量并优化性能权衡。
English: Deep image watermarks can coexist with minimal impact on quality and robustness, enabling ensemble methods that enhance capacity and trade-offs without retraining models.

Authors:Ruizhe Wang, Yeyun Gong, Xiao Liu, Guoshuai Zhao, Ziyue Yang, Baining Guo, Zhengjun Zha, Peng Cheng
Title: Optimizing Large Language Model Training Using FP4 Quantization
Abstract:
The growing computational demands of training large language models (LLMs) necessitate more efficient methods. Quantized training presents a promising solution by enabling low-bit arithmetic operations to reduce these costs. While FP8 precision has demonstrated feasibility, leveraging FP4 remains a challenge due to significant quantization errors and limited representational capacity. This work introduces the first FP4 training framework for LLMs, addressing these challenges with two key innovations: a differentiable quantization estimator for precise weight updates and an outlier clamping and compensation strategy to prevent activation collapse. To ensure stability, the framework integrates a mixed-precision training scheme and vector-wise quantization. Experimental results demonstrate that our FP4 framework achieves accuracy comparable to BF16 and FP8, with minimal degradation, scaling effectively to 13B-parameter LLMs trained on up to 100B tokens. With the emergence of next-generation hardware supporting FP4, our framework sets a foundation for efficient ultra-low precision training.
中文摘要:本研究首次提出针对大语言模型的FP4训练框架,通过可微分量化估计器和异常值处理策略,在实现与高精度相当准确度的同时,为超低精度高效训练奠定基础。
English Summary: This study introduces the first FP4 training framework for large language models, utilizing a differentiable quantization estimator and outlier management to achieve accuracy comparable to higher precisions while enabling efficient ultra-low-precision training.

Authors:Koji Inoue, Divesh Lala, Mikey Elmers, Keiko Ochi, Tatsuya Kawahara
Title: An LLM Benchmark for Addressee Recognition in Multi-modal Multi-party Dialogue
Abstract:
Handling multi-party dialogues represents a significant step for advancing spoken dialogue systems, necessitating the development of tasks specific to multi-party interactions. To address this challenge, we are constructing a multi-modal multi-party dialogue corpus of triadic (three-participant) discussions. This paper focuses on the task of addressee recognition, identifying who is being addressed to take the next turn, a critical component unique to multi-party dialogue systems. A subset of the corpus was annotated with addressee information, revealing that explicit addressees are indicated in approximately 20% of conversational turns. To evaluate the task's complexity, we benchmarked the performance of a large language model (GPT-4o) on addressee recognition. The results showed that GPT-4o achieved an accuracy only marginally above chance, underscoring the challenges of addressee recognition in multi-party dialogue. These findings highlight the need for further research to enhance the capabilities of large language models in understanding and navigating the intricacies of multi-party conversational dynamics.
中文: 本文通过构建多模态三方对话语料库研究多方对话中的受话人识别任务,发现GPT-4o模型表现仅略优于随机猜测,凸显了提升语言模型理解复杂对话动态能力的迫切需求。
English: This paper introduces a multimodal triadic dialogue corpus to address addressee recognition in multi-party conversations, revealing GPT-4o's near-chance performance and emphasizing the need for improved language model capabilities in handling complex dialogue dynamics.

Authors:Koji Inoue, Mikey Elmers, Divesh Lala, Tatsuya Kawahara
Title: Why Do We Laugh? Annotation and Taxonomy Generation for Laughable Contexts in Spontaneous Text Conversation
Abstract:
Laughter serves as a multifaceted communicative signal in human interaction, yet its identification within dialogue presents a significant challenge for conversational AI systems. This study addresses this challenge by annotating laughable contexts in Japanese spontaneous text conversation data and developing a taxonomy to classify the underlying reasons for such contexts. Initially, multiple annotators manually labeled laughable contexts using a binary decision (laughable or non-laughable). Subsequently, an LLM was used to generate explanations for the binary annotations of laughable contexts, which were then categorized into a taxonomy comprising ten categories, including "Empathy and Affinity" and "Humor and Surprise," highlighting the diverse range of laughter-inducing scenarios. The study also evaluated GPT-4o's performance in recognizing the majority labels of laughable contexts, achieving an F1 score of 43.14%. These findings contribute to the advancement of conversational AI by establishing a foundation for more nuanced recognition and generation of laughter, ultimately fostering more natural and engaging human-AI interactions.
Chinese: 本研究通过构建日语对话中可引发笑声情境的分类体系,利用GPT-4o实现了43.14%的F1值,为提升AI对笑声的识别与生成能力、促进更自然的人机交互奠定了基础。
English: This study develops a taxonomy for laughable contexts in Japanese conversations to enhance AI's nuanced recognition and generation of laughter, achieving a 43.14% F1 score with GPT-4o and advancing natural human-AI interaction.

Authors:Keane Ong, Rui Mao, Frank Xing, Ranjan Satapathy, Johan Sulaeman, Erik Cambria, Gianmarco Mengaldo
Title: ESGSenticNet: A Neurosymbolic Knowledge Base for Corporate Sustainability Analysis
Abstract:
Evaluating corporate sustainability performance is essential to drive sustainable business practices, amid the need for a more sustainable economy. However, this is hindered by the complexity and volume of corporate sustainability data (i.e. sustainability disclosures), not least by the effectiveness of the NLP tools used to analyse them. To this end, we identify three primary challenges - immateriality, complexity, and subjectivity, that exacerbate the difficulty of extracting insights from sustainability disclosures. To address these issues, we introduce ESGSenticNet, a publicly available knowledge base for sustainability analysis. ESGSenticNet is constructed from a neurosymbolic framework that integrates specialised concept parsing, GPT-4o inference, and semi-supervised label propagation, together with a hierarchical taxonomy. This approach culminates in a structured knowledge base of 44k knowledge triplets - ('halve carbon emission', supports, 'emissions control'), for effective sustainability analysis. Experiments indicate that ESGSenticNet, when deployed as a lexical method, more effectively captures relevant and actionable sustainability information from sustainability disclosures compared to state of the art baselines. Besides capturing a high number of unique ESG topic terms, ESGSenticNet outperforms baselines on the ESG relatedness and ESG action orientation of these terms by 26% and 31% respectively. These metrics describe the extent to which topic terms are related to ESG, and depict an action toward ESG. Moreover, when deployed as a lexical method, ESGSenticNet does not require any training, possessing a key advantage in its simplicity for non-technical stakeholders.
Chinese: 本文提出的ESGSenticNet神经符号知识库,通过无需训练即可优于现有方法捕捉ESG相关洞察,有效解决了企业可持续发展数据分析中的核心难题。
English: This paper introduces ESGSenticNet, a neurosymbolic knowledge base that effectively addresses challenges in analyzing corporate sustainability data by outperforming existing methods in capturing ESG-related insights without requiring training.

Authors:Keane Ong, Rui Mao, Deeksha Varshney, Frank Xing, Ranjan Satapathy, Johan Sulaeman, Erik Cambria, Gianmarco Mengaldo
Title: ESGSenticNet: A Neurosymbolic Knowledge Base for Corporate Sustainability Analysis
Abstract:
Evaluating corporate sustainability performance is essential to drive sustainable business practices, amid the need for a more sustainable economy. However, this is hindered by the complexity and volume of corporate sustainability data (i.e. sustainability disclosures), not least by the effectiveness of the NLP tools used to analyse them. To this end, we identify three primary challenges - immateriality, complexity, and subjectivity, that exacerbate the difficulty of extracting insights from sustainability disclosures. To address these issues, we introduce ESGSenticNet, a publicly available knowledge base for sustainability analysis. ESGSenticNet is constructed from a neurosymbolic framework that integrates specialised concept parsing, GPT-4o inference, and semi-supervised label propagation, together with a hierarchical taxonomy. This approach culminates in a structured knowledge base of 44k knowledge triplets - ('halve carbon emission', supports, 'emissions control'), for effective sustainability analysis. Experiments indicate that ESGSenticNet, when deployed as a lexical method, more effectively captures relevant and actionable sustainability information from sustainability disclosures compared to state of the art baselines. Besides capturing a high number of unique ESG topic terms, ESGSenticNet outperforms baselines on the ESG relatedness and ESG action orientation of these terms by 26% and 31% respectively. These metrics describe the extent to which topic terms are related to ESG, and depict an action toward ESG. Moreover, when deployed as a lexical method, ESGSenticNet does not require any training, possessing a key advantage in its simplicity for non-technical stakeholders.
Chinese: 本文提出的ESGSenticNet神经符号知识库,通过无需训练即可优于现有方法捕捉ESG相关洞察,有效解决了企业可持续发展数据分析中的核心难题。
English: This paper introduces ESGSenticNet, a neurosymbolic knowledge base that effectively addresses challenges in analyzing corporate sustainability data by outperforming existing methods in capturing ESG-related insights without requiring training.

Authors:Kaiyuan Zhang, Siyuan Cheng, Guangyu Shen, Bruno Ribeiro, Shengwei An, Pin-Yu Chen, Xiangyu Zhang, Ninghui Li
Title: CENSOR: Defense Against Gradient Inversion via Orthogonal Subspace Bayesian Sampling
Abstract:
Federated learning collaboratively trains a neural network on a global server, where each local client receives the current global model weights and sends back parameter updates (gradients) based on its local private data. The process of sending these model updates may leak client's private data information. Existing gradient inversion attacks can exploit this vulnerability to recover private training instances from a client's gradient vectors. Recently, researchers have proposed advanced gradient inversion techniques that existing defenses struggle to handle effectively. In this work, we present a novel defense tailored for large neural network models. Our defense capitalizes on the high dimensionality of the model parameters to perturb gradients within a subspace orthogonal to the original gradient. By leveraging cold posteriors over orthogonal subspaces, our defense implements a refined gradient update mechanism. This enables the selection of an optimal gradient that not only safeguards against gradient inversion attacks but also maintains model utility. We conduct comprehensive experiments across three different datasets and evaluate our defense against various state-of-the-art attacks and defenses. Code is available at https://censor-gradient.github.io.
Chinese: 本文提出了一种新的联邦学习防御机制,通过在正交子空间扰动梯度来抵御梯度反演攻击,同时保持模型实用性,并通过针对多种先进攻击的广泛实验验证了其有效性。
English: This paper introduces a novel defense mechanism for federated learning that perturbs gradients within an orthogonal subspace to protect against gradient inversion attacks while preserving model utility, validated through extensive experiments against state-of-the-art attacks.

Authors:Shuhe Wang, Xiaoya Li, Xiaofei Sun, Guoyin Wang, Tianwei Zhang, Jiwei Li, Eduard Hovy
Title: Turn That Frown Upside Down: FaceID Customization via Cross-Training Data
Abstract:
Existing face identity (FaceID) customization methods perform well but are limited to generating identical faces as the input, while in real-world applications, users often desire images of the same person but with variations, such as different expressions (e.g., smiling, angry) or angles (e.g., side profile). This limitation arises from the lack of datasets with controlled input-output facial variations, restricting models' ability to learn effective modifications. To address this issue, we propose CrossFaceID, the first large-scale, high-quality, and publicly available dataset specifically designed to improve the facial modification capabilities of FaceID customization models. Specifically, CrossFaceID consists of 40,000 text-image pairs from approximately 2,000 persons, with each person represented by around 20 images showcasing diverse facial attributes such as poses, expressions, angles, and adornments. During the training stage, a specific face of a person is used as input, and the FaceID customization model is forced to generate another image of the same person but with altered facial features. This allows the FaceID customization model to acquire the ability to personalize and modify known facial features during the inference stage. Experiments show that models fine-tuned on the CrossFaceID dataset retain its performance in preserving FaceID fidelity while significantly improving its face customization capabilities. To facilitate further advancements in the FaceID customization field, our code, constructed datasets, and trained models are fully available to the public.
Chinese: 现有人脸身份定制方法虽能精准复制面部,但无法生成如表情或角度等变化;新提出的CrossFaceID数据集解决了这一问题,在保持身份保真度的同时显著提升了面部定制能力。
English: Current FaceID customization methods excel at replicating faces but lack the ability to generate variations like expressions or angles, which is addressed by the new CrossFaceID dataset that enhances models' modification capabilities while maintaining identity fidelity.

Authors:Kerui Chen, Zhiliang Wu, Wenjin Hou, Kun Li, Hehe Fan, Yi Yang
Title: Prompt-Aware Controllable Shadow Removal
Abstract:
Shadow removal aims to restore the image content in shadowed regions. While deep learning-based methods have shown promising results, they still face key challenges: 1) uncontrolled removal of all shadows, or 2) controllable removal but heavily relies on precise shadow region masks. To address these issues, we introduce a novel paradigm: prompt-aware controllable shadow removal. Unlike existing approaches, our paradigm allows for targeted shadow removal from specific subjects based on user prompts (e.g., dots, lines, or subject masks). This approach eliminates the need for shadow annotations and offers flexible, user-controlled shadow removal. Specifically, we propose an end-to-end learnable model, the Prompt-Aware Controllable Shadow Removal Network (PACSRNet). PACSRNet consists of two key modules: a prompt-aware module that generates shadow masks for the specified subject based on the user prompt, and a shadow removal module that uses the shadow prior from the first module to restore the content in the shadowed regions. Additionally, we enhance the shadow removal module by incorporating feature information from the prompt-aware module through a linear operation, providing prompt-guided support for shadow removal. Recognizing that existing shadow removal datasets lack diverse user prompts, we contribute a new dataset specifically designed for prompt-based controllable shadow removal. Extensive experimental results demonstrate the effectiveness and superiority of PACSRNet.
中文摘要:本文提出PACSRNet,一种基于提示感知的可控阴影去除网络,通过用户提示实现对特定目标的定向阴影消除,无需阴影标注即可提供灵活的用户控制。
English Summary: This paper introduces PACSRNet, a prompt-aware controllable shadow removal network that enables targeted shadow elimination from specific subjects using user prompts, eliminating the need for shadow annotations while offering flexible user control.

Authors:Alireza Salemi, Julian Killingback, Hamed Zamani
Title: ExPerT: Effective and Explainable Evaluation of Personalized Long-Form Text Generation
Abstract:
Evaluating personalized text generated by large language models (LLMs) is challenging, as only the LLM user, i.e., prompt author, can reliably assess the output, but re-engaging the same individuals across studies is infeasible. This paper addresses the challenge of evaluating personalized text generation by introducing ExPerT, an explainable reference-based evaluation framework. ExPerT leverages an LLM to extract atomic aspects and their evidence from the generated and reference texts, match the aspects, and evaluate their alignment based on content and writing style -- two key attributes in personalized text generation. Additionally, ExPerT generates detailed, fine-grained explanations for every step of the evaluation process, enhancing transparency and interpretability. Our experiments demonstrate that ExPerT achieves a 7.2% relative improvement in alignment with human judgments compared to the state-of-the-art text generation evaluation methods. Furthermore, human evaluators rated the usability of ExPerT's explanations at 4.7 out of 5, highlighting its effectiveness in making evaluation decisions more interpretable.
中文: 本文提出可解释的ExPerT框架,通过分析生成文本与参考文本在内容和风格上的对齐来评估个性化文本生成,其与人类判断的相关性提升7.2%,且解释能力获得4.7/5的高分评级。
English: This paper introduces ExPerT, an explainable framework that evaluates personalized text generation by analyzing content and style alignment between generated and reference texts, achieving a 7.2% improvement in human judgment correlation and high interpretability ratings.

Authors:Zhiqi Li, Guo Chen, Shilong Liu, Shihao Wang, Vibashan VS, Yishen Ji, Shiyi Lan, Hao Zhang, Yilin Zhao, Subhashree Radhakrishnan, Nadine Chang, Karan Sapra, Amala Sanjay Deshmukh, Tuomas Rintamaki, Matthieu Le, Ilia Karmanov, Lukas Voegtle, Philipp Fischer, De-An Huang, Timo Roman, Tong Lu, Jose M. Alvarez, Bryan Catanzaro, Jan Kautz, Andrew Tao, Guilin Liu, Zhiding Yu
Title: Eagle 2: Building Post-Training Data Strategies from Scratch for Frontier Vision-Language Models
Abstract:
Recently, promising progress has been made by open-source vision-language models (VLMs) in bringing their capabilities closer to those of proprietary frontier models. However, most open-source models only publish their final model weights, leaving the critical details of data strategies and implementation largely opaque. In this work, we address VLM post-training from a data-centric perspective, showing the key role of data strategy in developing frontier VLMs. By studying and building our post-training data strategy from scratch, we share detailed insights into the development processes, aiming to benefit the development of competitive models for the open-source community. Our introduced data strategy, together with training recipes and model design, leads to a family of performant VLMs named Eagle2. Specifically, Eagle2-9B achieves state-of-the-art results across various multimodal benchmarks, matching certain competitive models with up to 70B parameters.
中文: 本研究从数据中心的视角探讨视觉语言模型的后训练,通过分享详细的数据策略和训练方案开发出性能领先的Eagle2模型系列,旨在推动开源社区发展具有竞争力的模型。
English: This work introduces a data-centric approach to vision-language model post-training, developing the Eagle2 family of models that achieve state-of-the-art performance by sharing detailed data strategies and training insights for the open-source community.

Authors:Alexan Ayrapetyan, Sofia Kostandian, Ara Yeroyan, Mher Yerznkanyan, Nikolay Karpov, Nune Tadevosyan, Vitaly Lavrukhin, Boris Ginsburg
Title: Methods to Increase the Amount of Data for Speech Recognition for Low Resource Languages
Abstract:
This study explores methods to increase data volume for low-resource languages using techniques such as crowdsourcing, pseudo-labeling, advanced data preprocessing and various permissive data sources such as audiobooks, Common Voice, YouTube. While these methods are well-explored for highresource languages, their application for low-resource languages remains underexplored. Using Armenian and Georgian as case studies, we demonstrate how linguistic and resource-specific characteristics influence the success of these methods. This work provides practical guidance for researchers to choose cost-effective and quality-driven dataset extension strategies for low-resource languages. The key takeaway from various data extension approaches is that paid crowd-sourcing offers the best balance between cost and quality, outperforming volunteer crowd-sourcing, open-source audiobooks, and unlabeled data usage. Ablation study shows that models trained on the expanded datasets outperform existing baselines and achieve 5.73% for Gergian and 9.9% for Armenian ASR word error rate using a relatively small FastConformer architecture. We open-sourced both the Armenian and Georgian models to allow further research and practical applications.
本研究展示了针对亚美尼亚语和格鲁吉亚语等低资源语言的有效数据扩展方法,其中付费众包在提升自动语音识别性能方面被证明最具成本效益。
This study demonstrates effective data expansion methods for low-resource languages like Armenian and Georgian, with paid crowdsourcing proving most cost-efficient for improving automatic speech recognition performance.

Authors:Mozhgan Hadadi, Mehdi Saraeian, Jackson Godbersen, Talukder Jubery, Yawei Li, Lakshmi Attigala, Aditya Balu, Soumik Sarkar, Patrick S. Schnable, Adarsh Krishnamurthy, Baskar Ganapathysubramanian
Title: Procedural Generation of 3D Maize Plant Architecture from LIDAR Data
Abstract:
This study introduces a robust framework for generating procedural 3D models of maize (Zea mays) plants from LiDAR point cloud data, offering a scalable alternative to traditional field-based phenotyping. Our framework leverages Non-Uniform Rational B-Spline (NURBS) surfaces to model the leaves of maize plants, combining Particle Swarm Optimization (PSO) for an initial approximation of the surface and a differentiable programming framework for precise refinement of the surface to fit the point cloud data. In the first optimization phase, PSO generates an approximate NURBS surface by optimizing its control points, aligning the surface with the LiDAR data, and providing a reliable starting point for refinement. The second phase uses NURBS-Diff, a differentiable programming framework, to enhance the accuracy of the initial fit by refining the surface geometry and capturing intricate leaf details. Our results demonstrate that, while PSO establishes a robust initial fit, the integration of differentiable NURBS significantly improves the overall quality and fidelity of the reconstructed surface. This hierarchical optimization strategy enables accurate 3D reconstruction of maize leaves across diverse genotypes, facilitating the subsequent extraction of complex traits like phyllotaxy. We demonstrate our approach on diverse genotypes of field-grown maize plants. All our codes are open-source to democratize these phenotyping approaches.
中文: 本研究提出一个可扩展框架,利用激光雷达数据和结合粒子群优化与可微分NURBS的两阶段优化方法,精确重建玉米叶片三维模型以改进表型分析,并开源全部代码。
English: This study presents a scalable framework using LiDAR data and a two-phase optimization approach with PSO and differentiable NURBS to accurately reconstruct 3D maize leaf models for enhanced phenotyping, with all code made open-source.

Authors:Ambreesh Parthasarathy, Chandrasekar Subramanian, Ganesh Senrayan, Shreyash Adappanavar, Aparna Taneja, Balaraman Ravindran, Milind Tambe
Title: Multilinguality in LLM-Designed Reward Functions for Restless Bandits: Effects on Task Performance and Fairness
Abstract:
Restless Multi-Armed Bandits (RMABs) have been successfully applied to resource allocation problems in a variety of settings, including public health. With the rapid development of powerful large language models (LLMs), they are increasingly used to design reward functions to better match human preferences. Recent work has shown that LLMs can be used to tailor automated allocation decisions to community needs using language prompts. However, this has been studied primarily for English prompts and with a focus on task performance only. This can be an issue since grassroots workers, especially in developing countries like India, prefer to work in local languages, some of which are low-resource. Further, given the nature of the problem, biases along population groups unintended by the user are also undesirable. In this work, we study the effects on both task performance and fairness when the DLM algorithm, a recent work on using LLMs to design reward functions for RMABs, is prompted with non-English language commands. Specifically, we run the model on a synthetic environment for various prompts translated into multiple languages. The prompts themselves vary in complexity. Our results show that the LLM-proposed reward functions are significantly better when prompted in English compared to other languages. We also find that the exact phrasing of the prompt impacts task performance. Further, as prompt complexity increases, performance worsens for all languages; however, it is more robust with English prompts than with lower-resource languages. On the fairness side, we find that low-resource languages and more complex prompts are both highly likely to create unfairness along unintended dimensions.
中文: 最新研究表明,在不安定多臂老虎机中使用非英语提示(尤其是低资源语言)设计奖励函数时,相比英语提示会导致任务性能显著下降并加剧不公平性,且随着提示复杂度增加,所有语言性能均恶化,但英语提示的稳定性更高。
English: Recent research reveals that using non-English prompts, especially in low-resource languages, for LLM-designed reward functions in Restless Multi-Armed Bandits leads to significantly worse task performance and increased unfairness compared to English prompts, with performance deteriorating further as prompt complexity rises.

Authors:Wenhao Sun, Bing Li, Grace Li Zhang, Xunzhao Yin, Cheng Zhuo, Ulf Schlichtmann
Title: Paradigm-Based Automatic HDL Code Generation Using LLMs
Abstract:
While large language models (LLMs) have demonstrated the ability to generate hardware description language (HDL) code for digital circuits, they still face the hallucination problem, which can result in the generation of incorrect HDL code or misinterpretation of specifications. In this work, we introduce a human-expert-inspired method to mitigate the hallucination of LLMs and enhance their performance in HDL code generation. We begin by constructing specialized paradigm blocks that consist of several steps designed to divide and conquer generation tasks, mirroring the design methodology of human experts. These steps include information extraction, human-like design flows, and the integration of external tools. LLMs are then instructed to classify the type of circuit in order to match it with the appropriate paradigm block and execute the block to generate the HDL codes. Additionally, we propose a two-phase workflow for multi-round generation, aimed at effectively improving the testbench pass rate of the generated HDL codes within a limited number of generation and verification rounds. Experimental results demonstrate that our method significantly enhances the functional correctness of the generated Verilog code
Chinese: 本研究提出一种受人类专家启发的方法,通过专用范式模块和两阶段工作流程来减少大语言模型在硬件描述语言代码生成中的幻觉问题,显著提升了功能正确性。
English: This study introduces a human-expert-inspired method to mitigate hallucinations in large language models during hardware description language code generation by using specialized paradigm blocks and a two-phase workflow, significantly improving functional correctness.

Authors:Minfeng Qi, Qin Wang, Ningran Li, Shiping Chen, Tianqing Zhu
Title: BRC20 Snipping Attack
Abstract:
In this paper, we introduce and implement BRC20 sniping attack. Our attack manipulates the BRC20 token transfers in open markets and disrupts the fairness among bidding participants. The long-standing principle of ``highest bidder wins'' is rendered ineffective. Typically, open BRC20 token markets rely on Partially Signed Bitcoin Transactions (PSBT) to broadcast selling intents and wait for buying auctions. Our attack targets the BRC20 buying process (i.e., transfer) by injecting a front-running transaction to complete the full signature of the PSBT. At its core, the attack exploits the mempool's fee-based transaction selection mechanism to snipe the victim transaction, replicate metadata, and front-run the legesmate transaction. This attack applies to platforms using PSBT for BRC20 token transfers, including popular Bitcoin exchanges and marketplaces (e.g., Magic Eden, Unisat, Gate.io, OKX). We implemented and tested the attack on a Bitcoin testnet (regtest), validating its effectiveness through multiple experimental rounds. Results show that the attacker consistently replaces legitimate transactions by submitting higher-fee PSBTs. We have also made responsible disclosures to the mentioned exchanges.
中文: 本文提出的BRC20狙击攻击利用比特币交易费机制抢先合法代币转账,通过在PSBT平台上提交更高费用的交易来破坏“价高者得”的市场公平性。
English: This paper introduces a BRC20 sniping attack that exploits Bitcoin's transaction fee mechanism to front-run legitimate token transfers, undermining market fairness by invalidating the "highest bidder wins" principle on platforms using PSBTs.

Authors:Minquan Cheng, Huimei Wei, Kai Wan, Giuseppe Caire
Title: A New Construction Structure on Coded Caching with Linear Subpacketization: Non-Half-Sum Disjoint Packing
Abstract:
Coded caching is a promising technique to effectively reduce peak traffic by using local caches and the multicast gains generated by these local caches. We prefer to design a coded caching scheme with the subpacketization $F$ and transmission load $R$ as small as possible since these are the key metrics for evaluating the implementation complexity and transmission efficiency of the scheme, respectively. However, most of the existing coded caching schemes have large subpacketizations which grow exponentially with the number of users $K$, and there are a few schemes with linear subpacketizations which have large transmission loads. In this paper, we focus on studying the linear subpacketization, i.e., $K=F$, coded caching scheme with low transmission load. Specifically, we first introduce a new combinatorial structure called non-half-sum disjoint packing (NHSDP) which can be used to generate a coded caching scheme with $K=F$. Then a class of new schemes is obtained by constructing NHSDP. Theoretical and numerical comparisons show that (i) compared to the existing schemes with linear subpacketization (to the number of users), the proposed scheme achieves a lower load; (ii) compared to some existing schemes with polynomial subpacketization, the proposed scheme can also achieve a lower load in some cases; (iii) compared to some existing schemes with exponential subpacketization, the proposed scheme has loads close to those of these schemes in some cases. Moreover, the new concept of NHSDP is closely related to the classical combinatorial structures such as cyclic difference packing (CDP), non-three-term arithmetic progressions (NTAP), and perfect hash family (PHF). These connections indicate that NHSDP is an important combinatorial structure in the field of combinatorial design.
Chinese: 本文提出了一种称为非半和不相交包装(NHSDP)的新型组合结构,用于设计具有线性分组化的编码缓存方案,在保持实现效率的同时,相比现有方案实现了更低的传输负载。
English: This paper introduces a novel combinatorial structure called non-half-sum disjoint packing (NHSDP) to design coded caching schemes with linear subpacketization, achieving lower transmission loads compared to existing approaches while maintaining implementation efficiency.

Authors:Jinyu Wang, Minquan Cheng, Kai Wan, Giuseppe Caire
Title: PDA Construction via Union of Cartesian Product Cache Configurations for Coded Caching
Abstract:
Caching is an efficient technique to reduce peak traffic by storing popular content in local caches. Placement delivery array (PDA) proposed by Yan et al. is a combinatorial structure to design coded caching schemes with uncoded placement and one-shot linear delivery. By taking the $m$-fold Cartesian product of a small base PDA, Wang et al. constructed a big PDA while maintaining the memory ratio and transmission load unchanged, which achieves linear growth in both the number of users and coded caching gain. In order to achieve exponential growth in both the number of users and coded caching gain, in this paper we propose a PDA construction by taking the union operation of the cache configurations from the $m$-fold Cartesian product of a base PDA. The resulting PDA leads to a coded caching scheme with subpacketization increasing sub-exponentially with the number of users while keeping the load constant for fixed memory ratio. By applying the proposed construction to existing base PDAs, three new coded caching schemes are obtained, which cover some existing schemes as special cases and can achieve lower load with simultaneously lower subpacketization for some memory ratios.
Chinese: 本文提出了一种采用并集操作的PDA构建方法,能在保持传输负载不变的同时,实现用户数量和编码缓存增益的指数级增长,且子分组化程度仅呈次指数增长。
English: This paper proposes a novel placement delivery array (PDA) construction method using union operations to achieve exponential growth in both user capacity and coded caching gain while maintaining constant transmission load and sub-exponential subpacketization growth.

Authors:Tianxiu Xie, Keke Gai, Jing Yu, Liehuang Zhu, Bin Xiao
Title: SLVC-DIDA: Signature-less Verifiable Credential-based Issuer-hiding and Multi-party Authentication for Decentralized Identity
Abstract:
As an emerging paradigm in digital identity, Decentralized Identity (DID) appears advantages over traditional identity management methods in a variety of aspects, e.g., enhancing user-centric online services and ensuring complete user autonomy and control. Verifiable Credential (VC) techniques are used to facilitate decentralized DID-based access control across multiple entities. However, existing DID schemes generally rely on a distributed public key infrastructure that also causes challenges, such as context information deduction, key exposure, and issuer data leakage. To address the issues above, this paper proposes a Permanent Issuer-Hiding (PIH)-based DID multi-party authentication framework with a signature-less VC model, named SLVC-DIDA, for the first time. Our proposed scheme avoids the dependence on signing keys by employing hashing and issuer membership proofs, which supports universal zero-knowledge multi-party DID authentications, eliminating additional technical integrations. We adopt a zero-knowledge RSA accumulator to maintain the anonymity of the issuer set, thereby enabling public verification while safeguarding the privacy of identity attributes via a Merkle tree-based VC list. By eliminating reliance on a Public Key Infrastructure (PKI), SLVC-DIDA enables fully decentralized issuance and verification of DIDs. Furthermore, our scheme ensures PIH through the implementation of the zero-knowledge Issuer set and VC list, so that the risks of key leakage and contextual inference attacks are effectively mitigated. Our experiments further evaluate the effectiveness and practicality of SLVC-DIDA.
中文: 本文提出SLVC-DIDA去中心化身份框架,通过无签名可验证凭证模型和零知识证明技术消除对公钥基础设施的依赖,在支持多方认证的同时有效提升安全性和隐私保护能力。
English: This paper introduces SLVC-DIDA, a decentralized identity framework that eliminates reliance on public key infrastructure by using a signature-less verifiable credential model and zero-knowledge proofs to enhance security and privacy while supporting multi-party authentication.

Authors:Bin Han, Ye Yuan, Hans D. Schotten
Title: VENENA: A Deceptive Visual Encryption Framework for Wireless Semantic Secrecy
Abstract:
Eavesdropping has been a long-standing threat to the security and privacy of wireless communications, since it is difficult to detect and costly to prevent. As networks evolve towards Sixth Generation (6G) and semantic communication becomes increasingly central to next-generation wireless systems, securing semantic information transmission emerges as a critical challenge. While classical physical layer security (PLS) focuses on passive security, the recently proposed concept of physical layer deception (PLD) offers a semantic encryption measure to actively deceive eavesdroppers. Yet the existing studies of PLD have been dominantly information-theoretical and link-level oriented, lacking considerations of system-level design and practical implementation. In this work we propose a novel artificial intelligence (AI)-enabled framework called Visual ENcryption for Eavesdropping NegAtion (VENENA), which combines the techniques of PLD, visual encryption, and image poisoning, into a comprehensive mechanism for deceptive secure semantic transmission in future wireless networks. By leveraging advanced vision transformers and semantic codecs, VENENA demonstrates how semantic security can be enhanced through the synergy of physical layer techniques and artificial intelligence, paving the way for secure semantic communication in 6G networks.
中文摘要:窃听对无线通信安全构成长期威胁,新提出的物理层欺骗技术虽能主动对抗窃听,但现有研究多集中于理论层面,缺乏系统级实施方案。
English Summary: Eavesdropping poses a persistent threat to wireless communication security, with the emerging field of physical layer deception offering active countermeasures, yet current research lacks system-level implementation focus.

Authors:Guoyu Li, Shengyu Ye, Chunyun Chen, Yang Wang, Fan Yang, Ting Cao, Cheng Liu, Mohamed M. Sabry, Mao Yang
Title: LUT-DLA: Lookup Table as Efficient Extreme Low-Bit Deep Learning Accelerator
Abstract:
The emergence of neural network capabilities invariably leads to a significant surge in computational demands due to expanding model sizes and increased computational complexity. To reduce model size and lower inference costs, recent research has focused on simplifying models and designing hardware accelerators using low-bit quantization. However, due to numerical representation limits, scalar quantization cannot reduce bit width lower than 1-bit, diminishing its benefits. To break through these limitations, we introduce LUT-DLA, a Look-Up Table (LUT) Deep Learning Accelerator Framework that utilizes vector quantization to convert neural network models into LUTs, achieving extreme low-bit quantization. The LUT-DLA framework facilitates efficient and cost-effective hardware accelerator designs and supports the LUTBoost algorithm, which helps to transform various DNN models into LUT-based models via multistage training, drastically cutting both computational and hardware overhead. Additionally, through co-design space exploration, LUT-DLA assesses the impact of various model and hardware parameters to fine-tune hardware configurations for different application scenarios, optimizing performance and efficiency. Our comprehensive experiments show that LUT-DLA achieves improvements in power efficiency and area efficiency with gains of $1.4$~$7.0\times$ and $1.5$~$146.1\times$, respectively, while maintaining only a modest accuracy drop. For CNNs, accuracy decreases by $0.1\%$~$3.1\%$ using the $L_2$ distance similarity, $0.1\%$~$3.4\%$ with the $L_1$ distance similarity, and $0.1\%$~$3.8\%$ when employing the Chebyshev distance similarity. For transformer-based models, the accuracy drop ranges from $1.4\%$ to $3.0\%$.
Chinese: LUT-DLA框架采用向量量化将神经网络转换为查找表,实现极低位量化,大幅降低计算和硬件成本,同时保持高效运行且精度损失极小。
English: The LUT-DLA framework utilizes vector quantization to convert neural networks into lookup tables, enabling extreme low-bit quantization that significantly reduces computational and hardware costs while maintaining high efficiency and minimal accuracy loss.

Authors:Wei Lu, Si-Bao Chen, Chris H. Q. Ding, Jin Tang, Bin Luo
Title: LWGANet: A Lightweight Group Attention Backbone for Remote Sensing Visual Tasks
Abstract:
Remote sensing (RS) visual tasks have gained significant academic and practical importance. However, they encounter numerous challenges that hinder effective feature extraction, including the detection and recognition of multiple objects exhibiting substantial variations in scale within a single image. While prior dual-branch or multi-branch architectural strategies have been effective in managing these object variances, they have concurrently resulted in considerable increases in computational demands and parameter counts. Consequently, these architectures are rendered less viable for deployment on resource-constrained devices. Contemporary lightweight backbone networks, designed primarily for natural images, frequently encounter difficulties in effectively extracting features from multi-scale objects, which compromises their efficacy in RS visual tasks. This article introduces LWGANet, a specialized lightweight backbone network tailored for RS visual tasks, incorporating a novel lightweight group attention (LWGA) module designed to address these specific challenges. LWGA module, tailored for RS imagery, adeptly harnesses redundant features to extract a wide range of spatial information, from local to global scales, without introducing additional complexity or computational overhead. This facilitates precise feature extraction across multiple scales within an efficient framework.LWGANet was rigorously evaluated across twelve datasets, which span four crucial RS visual tasks: scene classification, oriented object detection, semantic segmentation, and change detection. The results confirm LWGANet's widespread applicability and its ability to maintain an optimal balance between high performance and low complexity, achieving SOTA results across diverse datasets. LWGANet emerged as a novel solution for resource-limited scenarios requiring robust RS image processing capabilities.
中文: LWGANet是一种专为遥感视觉任务设计的新型轻量级骨干网络,其创新的组注意力模块能在不增加计算复杂度的前提下有效提取多尺度特征,在多种数据集上实现最优性能的同时保持低资源需求。
English: LWGANet is a novel lightweight backbone network designed for remote sensing visual tasks, featuring a group attention module that efficiently extracts multi-scale features without increasing computational complexity, achieving state-of-the-art performance across diverse datasets while maintaining low resource demands.

Authors:Chentianye Xu, Jionghao Lin, Tongshuang Wu, Vincent Aleven, Kenneth R. Koedinger
Title: Improving Automated Feedback Systems for Tutor Training in Low-Resource Scenarios through Data Augmentation
Abstract:
Tutoring is an effective instructional method for enhancing student learning, yet its success relies on the skill and experience of the tutors. This reliance presents challenges for the widespread implementation of tutoring, particularly in training novice tutors. To support tutor training programs, real-time automated feedback systems are essential for efficiently training large numbers of tutors. Lin et al.'s previous study employed Generative Pre-Trained Transformers (GPT) for sequence labeling to identify desirable and undesirable praise components in a tutor training dataset, providing explanatory feedback. However, this approach requires a significant amount of labeled data for fine-tuning, which is both labor-intensive and dependent on expert input. To address the challenges associated with extensive data labeling, the current study explores the use of prompting more advanced GPT models like GPT-4o to generate synthetic datasets for augmenting labeled response data, followed by fine-tuning a GPT-3.5 model. Our results demonstrate that our data augmentation approach generalizes effectively to identify other types of praise, compared to the same model fine-tuned without augmentation. These findings suggest that for data-intensive tasks, synthetic data generated through GPT model prompting can substantially enhance fine-tuned model performance in low-resource scenarios.
Chinese: 利用GPT模型生成合成数据,可在低资源环境下通过增强微调模型性能来提升导师培训效果,从而克服对大量标注数据的依赖。
English: Real-time automated feedback systems using GPT models can enhance tutor training by generating synthetic data to improve fine-tuned model performance in low-resource settings, overcoming the need for extensive labeled datasets.

Authors:Yang Chen, Jingcai Guo, Song Guo, Jingren Zhou, Dacheng Tao
Title: Towards Robust and Realistic Human Pose Estimation via WiFi Signals
Abstract:
Robust WiFi-based human pose estimation is a challenging task that bridges discrete and subtle WiFi signals to human skeletons. This paper revisits this problem and reveals two critical yet overlooked issues: 1) cross-domain gap, i.e., due to significant variations between source-target domain pose distributions; and 2) structural fidelity gap, i.e., predicted skeletal poses manifest distorted topology, usually with misplaced joints and disproportionate bone lengths. This paper fills these gaps by reformulating the task into a novel two-phase framework dubbed DT-Pose: Domain-consistent representation learning and Topology-constrained Pose decoding. Concretely, we first propose a temporal-consistent contrastive learning strategy with uniformity regularization, coupled with self-supervised masking-reconstruction operations, to enable robust learning of domain-consistent and motion-discriminative WiFi-specific representations. Beyond this, we introduce a simple yet effective pose decoder with task prompts, which integrates Graph Convolution Network (GCN) and Transformer layers to constrain the topology structure of the generated skeleton by exploring the adjacent-overarching relationships among human joints. Extensive experiments conducted on various benchmark datasets highlight the superior performance of our method in tackling these fundamental challenges in both 2D/3D human pose estimation tasks.
中文: 本文提出DT-Pose双阶段框架,通过域一致性表征学习和拓扑约束解码,解决WiFi姿态估计中的跨域差异和结构保真度问题。
English: This paper introduces DT-Pose, a two-phase framework addressing cross-domain and structural fidelity gaps in WiFi-based pose estimation through domain-consistent representation learning and topology-constrained decoding.

Authors:Conrad Borchers, Danielle R. Thomas, Jionghao Lin, Ralph Abboud, Kenneth R. Koedinger
Title: Augmenting Human-Annotated Training Data with Large Language Model Generation and Distillation in Open-Response Assessment
Abstract:
Large Language Models (LLMs) like GPT-4o can help automate text classification tasks at low cost and scale. However, there are major concerns about the validity and reliability of LLM outputs. By contrast, human coding is generally more reliable but expensive to procure at scale. In this study, we propose a hybrid solution to leverage the strengths of both. We combine human-coded data and synthetic LLM-produced data to fine-tune a classical machine learning classifier, distilling both into a smaller BERT model. We evaluate our method on a human-coded test set as a validity measure for LLM output quality. In three experiments, we systematically vary LLM-generated samples' size, variety, and consistency, informed by best practices in LLM tuning. Our findings indicate that augmenting datasets with synthetic samples improves classifier performance, with optimal results achieved at an 80% synthetic to 20% human-coded data ratio. Lower temperature settings of 0.3, corresponding to less variability in LLM generations, produced more stable improvements but also limited model learning from augmented samples. In contrast, higher temperature settings (0.7 and above) introduced greater variability in performance estimates and, at times, lower performance. Hence, LLMs may produce more uniform output that classifiers overfit to earlier or produce more diverse output that runs the risk of deteriorating model performance through information irrelevant to the prediction task. Filtering out inconsistent synthetic samples did not enhance performance. We conclude that integrating human and LLM-generated data to improve text classification models in assessment offers a scalable solution that leverages both the accuracy of human coding and the variety of LLM outputs.
中文: 本研究提出一种混合方法,结合人工标注数据和LLM生成的合成数据来微调BERT分类器,发现在80%合成数据与20%人工数据的比例下,较低的温度设置能带来更稳定的性能提升。
English: This study proposes a hybrid approach that combines human-coded and LLM-generated synthetic data to fine-tune a BERT classifier, finding optimal performance at an 80% synthetic to 20% human data ratio with lower temperature settings yielding more stable improvements.

Authors:Sitong Gong, Yunzhi Zhuge, Lu Zhang, Zongxin Yang, Pingping Zhang, Huchuan Lu
Title: The Devil is in Temporal Token: High Quality Video Reasoning Segmentation
Abstract:
Existing methods for Video Reasoning Segmentation rely heavily on a single special token to represent the object in the keyframe or the entire video, inadequately capturing spatial complexity and inter-frame motion. To overcome these challenges, we propose VRS-HQ, an end-to-end video reasoning segmentation approach that leverages Multimodal Large Language Models (MLLMs) to inject rich spatiotemporal features into hierarchical tokens.Our key innovations include a Temporal Dynamic Aggregation (TDA) and a Token-driven Keyframe Selection (TKS). Specifically, we design frame-level and temporal-level tokens that utilize MLLM's autoregressive learning to effectively capture both local and global information. Subsequently, we apply a similarity-based weighted fusion and frame selection strategy, then utilize SAM2 to perform keyframe segmentation and propagation. To enhance keyframe localization accuracy, the TKS filters keyframes based on SAM2's occlusion scores during inference. VRS-HQ achieves state-of-the-art performance on ReVOS, surpassing VISA by 5.9%/12.5%/9.1% in J&F scores across the three subsets. These results highlight the strong temporal reasoning and segmentation capabilities of our method. Code and model weights will be released at VRS-HQ.
中文: VRS-HQ提出了一种端到端视频推理分割方法,利用多模态大语言模型结合时序动态聚合和令牌驱动关键帧选择来增强时空特征表示,在ReVOS基准测试中取得了最先进的性能。
English: VRS-HQ introduces an end-to-end video reasoning segmentation method using Multimodal Large Language Models with Temporal Dynamic Aggregation and Token-driven Keyframe Selection to enhance spatiotemporal feature representation, achieving state-of-the-art performance on ReVOS benchmarks.

Authors:Shanchuan Lin, Xin Xia, Yuxi Ren, Ceyuan Yang, Xuefeng Xiao, Lu Jiang
Title: Diffusion Adversarial Post-Training for One-Step Video Generation
Abstract:
The diffusion models are widely used for image and video generation, but their iterative generation process is slow and expansive. While existing distillation approaches have demonstrated the potential for one-step generation in the image domain, they still suffer from significant quality degradation. In this work, we propose Adversarial Post-Training (APT) against real data following diffusion pre-training for one-step video generation. To improve the training stability and quality, we introduce several improvements to the model architecture and training procedures, along with an approximated R1 regularization objective. Empirically, our experiments show that our adversarial post-trained model, Seaweed-APT, can generate 2-second, 1280x720, 24fps videos in real time using a single forward evaluation step. Additionally, our model is capable of generating 1024px images in a single step, achieving quality comparable to state-of-the-art methods.
Chinese: 提出的对抗性后训练(APT)方法实现了高效的单步视频和图像生成,在质量和稳定性上均有提升,能够实时生成媲美顶尖技术的高分辨率内容。
English: The proposed Adversarial Post-Training (APT) method enables efficient one-step video and image generation with improved quality and stability, achieving real-time high-resolution output comparable to state-of-the-art techniques.

Authors:Shanchuan Lin, Xin Xia, Yuxi Ren, Ceyuan Yang, Xuefeng Xiao, Lu Jiang
Title: Diffusion Adversarial Post-Training for One-Step Video Generation
Abstract:
The diffusion models are widely used for image and video generation, but their iterative generation process is slow and expansive. While existing distillation approaches have demonstrated the potential for one-step generation in the image domain, they still suffer from significant quality degradation. In this work, we propose Adversarial Post-Training (APT) against real data following diffusion pre-training for one-step video generation. To improve the training stability and quality, we introduce several improvements to the model architecture and training procedures, along with an approximated R1 regularization objective. Empirically, our experiments show that our adversarial post-trained model, Seaweed-APT, can generate 2-second, 1280x720, 24fps videos in real time using a single forward evaluation step. Additionally, our model is capable of generating 1024px images in a single step, achieving quality comparable to state-of-the-art methods.
Chinese: 提出的对抗性后训练(APT)方法实现了高效的单步视频和图像生成,在质量和稳定性上均有提升,能够实时生成媲美顶尖技术的高分辨率内容。
English: The proposed Adversarial Post-Training (APT) method enables efficient one-step video and image generation with improved quality and stability, achieving real-time high-resolution output comparable to state-of-the-art techniques.

Authors:Yuxue Yang, Lue Fan, Zuzeng Lin, Feng Wang, Zhaoxiang Zhang
Title: LayerAnimate: Layer-level Control for Animation
Abstract:
Traditional animation production decomposes visual elements into discrete layers to enable independent processing for sketching, refining, coloring, and in-betweening. Existing anime generation video methods typically treat animation as a distinct data domain different from real-world videos, lacking fine-grained control at the layer level. To bridge this gap, we introduce LayerAnimate, a novel video diffusion framework with layer-aware architecture that empowers the manipulation of layers through layer-level controls. The development of a layer-aware framework faces a significant data scarcity challenge due to the commercial sensitivity of professional animation assets. To address the limitation, we propose a data curation pipeline featuring Automated Element Segmentation and Motion-based Hierarchical Merging. Through quantitative and qualitative comparisons, and user study, we demonstrate that LayerAnimate outperforms current methods in terms of animation quality, control precision, and usability, making it an effective tool for both professional animators and amateur enthusiasts. This framework opens up new possibilities for layer-level animation applications and creative flexibility. Our code is available at https://layeranimate.github.io.
中文: LayerAnimate是一种创新的视频扩散框架,通过引入层级感知架构和控制,解决了传统动画方法缺乏细粒度操控的局限,为专业动画师和业余爱好者提供了更精准的动画制作工具和创作灵活性。
English: LayerAnimate is a novel video diffusion framework that introduces layer-aware architecture and controls to overcome the limitations of traditional animation methods, enabling fine-grained manipulation and enhanced creative flexibility for both professionals and amateurs.

Authors:Yifu Qiu, Varun Embar, Yizhe Zhang, Navdeep Jaitly, Shay B. Cohen, Benjamin Han
Title: Eliciting In-context Retrieval and Reasoning for Long-context Large Language Models
Abstract:
Recent advancements in long-context language models (LCLMs) promise to transform Retrieval-Augmented Generation (RAG) by simplifying pipelines. With their expanded context windows, LCLMs can process entire knowledge bases and perform retrieval and reasoning directly -- a capability we define as In-Context Retrieval and Reasoning (ICR^2). However, existing benchmarks like LOFT often overestimate LCLM performance by providing overly simplified contexts. To address this, we introduce ICR^2, a benchmark that evaluates LCLMs in more realistic scenarios by including confounding passages retrieved with strong retrievers. We then propose three methods to enhance LCLM performance: (1) retrieve-then-generate fine-tuning, (2) retrieval-attention-probing, which uses attention heads to filter and de-noise long contexts during decoding, and (3) joint retrieval head training alongside the generation head. Our evaluation of five well-known LCLMs on LOFT and ICR^2 demonstrates significant gains with our best approach applied to Mistral-7B: +17 and +15 points by Exact Match on LOFT, and +13 and +2 points on ICR^2, compared to vanilla RAG and supervised fine-tuning, respectively. It even outperforms GPT-4-Turbo on most tasks despite being a much smaller model.
Chinese: 近期长上下文语言模型(LCLMs)实现了直接上下文检索与推理(ICR²),提出的ICR²基准通过引入真实干扰段落进行评估,三种优化方法显著提升了模型性能,在多数任务中甚至超越了GPT-4-Turbo。
English: Recent long-context language models (LCLMs) enable direct in-context retrieval and reasoning (ICR²), and the proposed ICR² benchmark with realistic confounding passages evaluates them, while three enhancement methods significantly boost performance, even surpassing GPT-4-Turbo in most tasks.

Authors:Mobai Xue, Jun Du, Zhenrong Zhang, Jiefeng Ma, Qikai Chang, Pengfei Hu, Jianshu Zhang, Yu Hu
Title: Skeleton and Font Generation Network for Zero-shot Chinese Character Generation
Abstract:
Automatic font generation remains a challenging research issue, primarily due to the vast number of Chinese characters, each with unique and intricate structures. Our investigation of previous studies reveals inherent bias capable of causing structural changes in characters. Specifically, when generating a Chinese character similar to, but different from, those in the training samples, the bias is prone to either correcting or ignoring these subtle variations. To address this concern, we propose a novel Skeleton and Font Generation Network (SFGN) to achieve a more robust Chinese character font generation. Our approach includes a skeleton builder and font generator. The skeleton builder synthesizes content features using low-resource text input, enabling our technique to realize font generation independently of content image inputs. Unlike previous font generation methods that treat font style as a global embedding, we introduce a font generator to align content and style features on the radical level, which is a brand-new perspective for font generation. Except for common characters, we also conduct experiments on misspelled characters, a substantial portion of which slightly differs from the common ones. Our approach visually demonstrates the efficacy of generated images and outperforms current state-of-the-art font generation methods. Moreover, we believe that misspelled character generation have significant pedagogical implications and verify such supposition through experiments. We used generated misspelled characters as data augmentation in Chinese character error correction tasks, simulating the scenario where students learn handwritten Chinese characters with the help of misspelled characters. The significantly improved performance of error correction tasks demonstrates the effectiveness of our proposed approach and the value of misspelled character generation.
中文: 本文提出了一种新颖的骨架与字体生成网络(SFGN),通过低资源输入合成内容特征并在部首级别对齐样式,改进了中文字体生成,在错别字生成等教学应用中表现出色,显著提升了汉字纠错任务的性能。
English: This paper introduces a novel Skeleton and Font Generation Network (SFGN) that enhances Chinese character font generation by synthesizing content features from low-resource inputs and aligning style at the radical level, outperforming existing methods and proving effective for pedagogical applications like error correction through misspelled character generation.

Authors:Sitong Gong, Yunzhi Zhuge, Lu Zhang, Yifan Wang, Pingping Zhang, Lijun Wang, Huchuan Lu
Title: AVS-Mamba: Exploring Temporal and Multi-modal Mamba for Audio-Visual Segmentation
Abstract:
The essence of audio-visual segmentation (AVS) lies in locating and delineating sound-emitting objects within a video stream. While Transformer-based methods have shown promise, their handling of long-range dependencies struggles due to quadratic computational costs, presenting a bottleneck in complex scenarios. To overcome this limitation and facilitate complex multi-modal comprehension with linear complexity, we introduce AVS-Mamba, a selective state space model to address the AVS task. Our framework incorporates two key components for video understanding and cross-modal learning: Temporal Mamba Block for sequential video processing and Vision-to-Audio Fusion Block for advanced audio-vision integration. Building on this, we develop the Multi-scale Temporal Encoder, aimed at enhancing the learning of visual features across scales, facilitating the perception of intra- and inter-frame information. To perform multi-modal fusion, we propose the Modality Aggregation Decoder, leveraging the Vision-to-Audio Fusion Block to integrate visual features into audio features across both frame and temporal levels. Further, we adopt the Contextual Integration Pyramid to perform audio-to-vision spatial-temporal context collaboration. Through these innovative contributions, our approach achieves new state-of-the-art results on the AVSBench-object and AVSBench-semantic datasets. Our source code and model weights are available at AVS-Mamba.
Chinese: 本研究提出AVS-Mamba,一种选择性状态空间模型,通过线性复杂度实现高效的音视频分割,克服了Transformer方法在处理长距离依赖时的局限性,并在基准数据集上取得了最先进的成果。
English: The study introduces AVS-Mamba, a selective state space model that overcomes the limitations of Transformer-based methods by enabling efficient audio-visual segmentation with linear complexity, achieving state-of-the-art results on benchmark datasets.

Authors:Daniela Pinto, Ivone Amorim, Eva Maia, Isabel Praça
Title: A Novel Approach to Network Traffic Analysis: the HERA tool
Abstract:
Cybersecurity threats highlight the need for robust network intrusion detection systems to identify malicious behaviour. These systems rely heavily on large datasets to train machine learning models capable of detecting patterns and predicting threats. In the past two decades, researchers have produced a multitude of datasets, however, some widely utilised recent datasets generated with CICFlowMeter contain inaccuracies. These result in flow generation and feature extraction inconsistencies, leading to skewed results and reduced system effectiveness. Other tools in this context lack ease of use, customizable feature sets, and flow labelling options. In this work, we introduce HERA, a new open-source tool that generates flow files and labelled or unlabelled datasets with user-defined features. Validated and tested with the UNSW-NB15 dataset, HERA demonstrated accurate flow and label generation.
Chinese: 摘要介绍了HERA这一新型开源工具,旨在解决现有网络入侵检测数据集中的不准确性和局限性,通过支持自定义流文件和特征生成,并在UNSW-NB15数据集上成功验证其准确性。
English: The abstract introduces HERA, a new open-source tool designed to address inaccuracies and limitations in existing network intrusion detection datasets by enabling customizable flow file and feature generation, validated successfully with the UNSW-NB15 dataset.

Authors:ZeKe Xiao, Qin Wang, Hammond Pearce, Shiping Chen
Title: Logic Meets Magic: LLMs Cracking Smart Contract Vulnerabilities
Abstract:
Smart contract vulnerabilities caused significant economic losses in blockchain applications. Large Language Models (LLMs) provide new possibilities for addressing this time-consuming task. However, state-of-the-art LLM-based detection solutions are often plagued by high false-positive rates. In this paper, we push the boundaries of existing research in two key ways. First, our evaluation is based on Solidity v0.8, offering the most up-to-date insights compared to prior studies that focus on older versions (v0.4). Second, we leverage the latest five LLM models (across companies), ensuring comprehensive coverage across the most advanced capabilities in the field. We conducted a series of rigorous evaluations. Our experiments demonstrate that a well-designed prompt can reduce the false-positive rate by over 60%. Surprisingly, we also discovered that the recall rate for detecting some specific vulnerabilities in Solidity v0.8 has dropped to just 13% compared to earlier versions (i.e., v0.4). Further analysis reveals the root cause of this decline: the reliance of LLMs on identifying changes in newly introduced libraries and frameworks during detection.
中文摘要:本研究通过采用最新Solidity v0.8版本和前沿大语言模型推进智能合约漏洞检测,优化提示词使误报率降低60%,但发现因模型依赖新库变更导致漏洞召回率骤降至13%。
English Summary: This study advances smart contract vulnerability detection by evaluating the latest Solidity v0.8 with cutting-edge LLMs, achieving a 60% false-positive reduction through optimized prompts but revealing a sharp recall drop to 13% due to LLMs' dependency on new library changes.

Authors:Subrata Kumer Paul, Abu Saleh Musa Miah, Rakhi Rani Paul, Md. Ekramul Hamid, Jungpil Shin, Md Abdur Rahim
Title: IoT-Based Real-Time Medical-Related Human Activity Recognition Using Skeletons and Multi-Stage Deep Learning for Healthcare
Abstract:
The Internet of Things (IoT) and mobile technology have significantly transformed healthcare by enabling real-time monitoring and diagnosis of patients. Recognizing medical-related human activities (MRHA) is pivotal for healthcare systems, particularly for identifying actions that are critical to patient well-being. However, challenges such as high computational demands, low accuracy, and limited adaptability persist in Human Motion Recognition (HMR). While some studies have integrated HMR with IoT for real-time healthcare applications, limited research has focused on recognizing MRHA as essential for effective patient monitoring. This study proposes a novel HMR method for MRHA detection, leveraging multi-stage deep learning techniques integrated with IoT. The approach employs EfficientNet to extract optimized spatial features from skeleton frame sequences using seven Mobile Inverted Bottleneck Convolutions (MBConv) blocks, followed by ConvLSTM to capture spatio-temporal patterns. A classification module with global average pooling, a fully connected layer, and a dropout layer generates the final predictions. The model is evaluated on the NTU RGB+D 120 and HMDB51 datasets, focusing on MRHA, such as sneezing, falling, walking, sitting, etc. It achieves 94.85% accuracy for cross-subject evaluations and 96.45% for cross-view evaluations on NTU RGB+D 120, along with 89.00% accuracy on HMDB51. Additionally, the system integrates IoT capabilities using a Raspberry Pi and GSM module, delivering real-time alerts via Twilios SMS service to caregivers and patients. This scalable and efficient solution bridges the gap between HMR and IoT, advancing patient monitoring, improving healthcare outcomes, and reducing costs.
中文: 本研究提出了一种结合物联网的多阶段深度学习新方法,用于精确识别医疗相关人体活动,在基准数据集上取得高准确率,并通过实时警报提升患者监护水平。
English: This study introduces a novel human motion recognition method using multi-stage deep learning integrated with IoT to accurately detect medical-related activities, achieving high accuracy on benchmark datasets and enabling real-time alerts for improved patient monitoring.

Authors:Erlong Liu, Yu-Chang Wu, Xiaobin Huang, Chengrui Gao, Ren-Jian Wang, Ke Xue, Chao Qian
Title: Pareto Set Learning for Multi-Objective Reinforcement Learning
Abstract:
Multi-objective decision-making problems have emerged in numerous real-world scenarios, such as video games, navigation and robotics. Considering the clear advantages of Reinforcement Learning (RL) in optimizing decision-making processes, researchers have delved into the development of Multi-Objective RL (MORL) methods for solving multi-objective decision problems. However, previous methods either cannot obtain the entire Pareto front, or employ only a single policy network for all the preferences over multiple objectives, which may not produce personalized solutions for each preference. To address these limitations, we propose a novel decomposition-based framework for MORL, Pareto Set Learning for MORL (PSL-MORL), that harnesses the generation capability of hypernetwork to produce the parameters of the policy network for each decomposition weight, generating relatively distinct policies for various scalarized subproblems with high efficiency. PSL-MORL is a general framework, which is compatible for any RL algorithm. The theoretical result guarantees the superiority of the model capacity of PSL-MORL and the optimality of the obtained policy network. Through extensive experiments on diverse benchmarks, we demonstrate the effectiveness of PSL-MORL in achieving dense coverage of the Pareto front, significantly outperforming state-of-the-art MORL methods in the hypervolume and sparsity indicators.
Chinese: 本文提出PSL-MORL这一基于分解的新框架,利用超网络为多目标强化学习生成差异化策略,实现了对帕累托前沿的密集覆盖,并在超体积和稀疏性指标上显著优于现有方法。
English: This paper introduces PSL-MORL, a novel decomposition-based framework that utilizes hypernetworks to generate distinct policies for multi-objective reinforcement learning, achieving superior Pareto front coverage and outperforming existing methods in efficiency and performance metrics.

Authors:Wen Wu, Ziyun Cui, Chang Lei, Yinan Duan, Diyang Qu, Ji Wu, Bowen Zhou, Runsen Chen, Chao Zhang
Title: The 1st SpeechWellness Challenge: Detecting Suicide Risk Among Adolescents
Abstract:
The 1st SpeechWellness Challenge (SW1) aims to advance methods for detecting current suicide risk in adolescents using speech analysis techniques. Suicide among adolescents is a critical public health issue globally. Early detection of suicidal tendencies can lead to timely intervention and potentially save lives. Traditional methods of assessment often rely on self-reporting or clinical interviews, which may not always be accessible. The SW1 challenge addresses this gap by exploring speech as a non-invasive and readily available indicator of mental health. We release the SW1 dataset which contains speech recordings from 600 adolescents aged 10-18 years. By focusing on speech generated from natural tasks, the challenge seeks to uncover patterns and markers that correlate with current suicide risk.
Chinese: 首届语音健康挑战赛(SW1)发布了包含600名青少年语音记录的数据集,旨在通过分析自然任务中的语音模式,开发无创检测当前自杀风险的方法,以弥补传统临床评估的不足。
English: The 1st SpeechWellness Challenge (SW1) introduces a dataset of 600 adolescent speech recordings to develop non-invasive speech analysis methods for detecting current suicide risk, addressing limitations of traditional clinical assessments.

Authors:Wen Wu, Ziyun Cui, Chang Lei, Yinan Duan, Diyang Qu, Ji Wu, Bowen Zhou, Runsen Chen, Chao Zhang
Title: The 1st SpeechWellness Challenge: Detecting Suicide Risk Among Adolescents
Abstract:
The 1st SpeechWellness Challenge (SW1) aims to advance methods for detecting current suicide risk in adolescents using speech analysis techniques. Suicide among adolescents is a critical public health issue globally. Early detection of suicidal tendencies can lead to timely intervention and potentially save lives. Traditional methods of assessment often rely on self-reporting or clinical interviews, which may not always be accessible. The SW1 challenge addresses this gap by exploring speech as a non-invasive and readily available indicator of mental health. We release the SW1 dataset which contains speech recordings from 600 adolescents aged 10-18 years. By focusing on speech generated from natural tasks, the challenge seeks to uncover patterns and markers that correlate with current suicide risk.
Chinese: 首届语音健康挑战赛(SW1)发布了包含600名青少年语音记录的数据集,旨在通过分析自然任务中的语音模式,开发无创检测当前自杀风险的方法,以弥补传统临床评估的不足。
English: The 1st SpeechWellness Challenge (SW1) introduces a dataset of 600 adolescent speech recordings to develop non-invasive speech analysis methods for detecting current suicide risk, addressing limitations of traditional clinical assessments.

Authors:Zhenyu Pan, Xuefeng Song, Yunkun Wang, Rongyu Cao, Binhua Li, Yongbin Li, Han Liu
Title: Do Code LLMs Understand Design Patterns?
Abstract:
Code Large Language Models (LLMs) demonstrate great versatility in adapting to various downstream tasks, including code generation and completion, as well as bug detection and fixing. However, Code LLMs often fail to capture existing coding standards, leading to the generation of code that conflicts with the required design patterns for a given project. As a result, developers must post-process to adapt the generated code to the project's design norms. In this work, we empirically investigate the biases of Code LLMs in software development. Through carefully designed experiments, we assess the models' understanding of design patterns across recognition, comprehension, and generation. Our findings reveal that biases in Code LLMs significantly affect the reliability of downstream tasks.
中文: 代码大语言模型在多种任务中表现出强大的适应性,但常无法遵循项目特定的设计模式,需要人工后处理,其固有偏见严重影响了下游任务的可靠性。
English: Code LLMs show strong adaptability in various tasks but often fail to adhere to project-specific design patterns, requiring manual post-processing and undermining task reliability due to inherent biases.

Authors:Lanlan Feng, Ce Zhu, Yipeng Liu, Saiprasad Ravishankar, Longxiu Huang
Title: Learnable Scaled Gradient Descent for Guaranteed Robust Tensor PCA
Abstract:
Robust tensor principal component analysis (RTPCA) aims to separate the low-rank and sparse components from multi-dimensional data, making it an essential technique in the signal processing and computer vision fields. Recently emerging tensor singular value decomposition (t-SVD) has gained considerable attention for its ability to better capture the low-rank structure of tensors compared to traditional matrix SVD. However, existing methods often rely on the computationally expensive tensor nuclear norm (TNN), which limits their scalability for real-world tensors. To address this issue, we explore an efficient scaled gradient descent (SGD) approach within the t-SVD framework for the first time, and propose the RTPCA-SGD method. Theoretically, we rigorously establish the recovery guarantees of RTPCA-SGD under mild assumptions, demonstrating that with appropriate parameter selection, it achieves linear convergence to the true low-rank tensor at a constant rate, independent of the condition number. To enhance its practical applicability, we further propose a learnable self-supervised deep unfolding model, which enables effective parameter learning. Numerical experiments on both synthetic and real-world datasets demonstrate the superior performance of the proposed methods while maintaining competitive computational efficiency, especially consuming less time than RTPCA-TNN.
中文: 本文提出RTPCA-SGD方法,在t-SVD框架下首次采用缩放梯度下降技术实现鲁棒张量主成分分析,不仅理论保证线性收敛,且在计算效率上显著优于基于张量核范数的传统方法。
English: This paper introduces RTPCA-SGD, an efficient scaled gradient descent method within the t-SVD framework for robust tensor principal component analysis, which achieves linear convergence and superior computational efficiency compared to existing tensor nuclear norm approaches.

Authors:Alireza Salemi, Cheng Li, Mingyang Zhang, Qiaozhu Mei, Weize Kong, Tao Chen, Zhuowan Li, Michael Bendersky, Hamed Zamani
Title: Reasoning-Enhanced Self-Training for Long-Form Personalized Text Generation
Abstract:
Personalized text generation requires a unique ability of large language models (LLMs) to learn from context that they often do not encounter during their standard training. One way to encourage LLMs to better use personalized context for generating outputs that better align with the user's expectations is to instruct them to reason over the user's past preferences, background knowledge, or writing style. To achieve this, we propose Reasoning-Enhanced Self-Training for Personalized Text Generation (REST-PG), a framework that trains LLMs to reason over personal data during response generation. REST-PG first generates reasoning paths to train the LLM's reasoning abilities and then employs Expectation-Maximization Reinforced Self-Training to iteratively train the LLM based on its own high-reward outputs. We evaluate REST-PG on the LongLaMP benchmark, consisting of four diverse personalized long-form text generation tasks. Our experiments demonstrate that REST-PG achieves significant improvements over state-of-the-art baselines, with an average relative performance gain of 14.5% on the benchmark.
Chinese: REST-PG框架通过生成推理路径和迭代自训练,使大语言模型能够基于用户个性化上下文进行推理,在LongLaMP基准测试中实现了14.5%的平均性能提升。
English: The REST-PG framework enhances personalized text generation by training large language models to reason over user-specific context through generated reasoning paths and iterative self-training, achieving a 14.5% average performance improvement on the LongLaMP benchmark.

Authors:Naoki Wake, Atsushi Kanehira, Jun Takamatsu, Kazuhiro Sasabuchi, Katsushi Ikeuchi
Title: VLM-driven Behavior Tree for Context-aware Task Planning
Abstract:
The use of Large Language Models (LLMs) for generating Behavior Trees (BTs) has recently gained attention in the robotics community, yet remains in its early stages of development. In this paper, we propose a novel framework that leverages Vision-Language Models (VLMs) to interactively generate and edit BTs that address visual conditions, enabling context-aware robot operations in visually complex environments. A key feature of our approach lies in the conditional control through self-prompted visual conditions. Specifically, the VLM generates BTs with visual condition nodes, where conditions are expressed as free-form text. Another VLM process integrates the text into its prompt and evaluates the conditions against real-world images during robot execution. We validated our framework in a real-world cafe scenario, demonstrating both its feasibility and limitations.
中文: 本文提出了一种新颖框架,利用视觉语言模型交互式生成和编辑带有视觉条件节点的行为树,使机器人能在视觉复杂环境中进行情境感知操作,并在真实咖啡馆场景中验证了其可行性。
English: This paper introduces a novel framework that uses Vision-Language Models to interactively generate and edit Behavior Trees with visual condition nodes, enabling context-aware robot operations in visually complex environments, as validated in a real-world cafe scenario.

Authors:Chris Samarinas, Alexander Krubner, Alireza Salemi, Youngwoo Kim, Hamed Zamani
Title: Beyond Factual Accuracy: Evaluating Coverage of Diverse Factual Information in Long-form Text Generation
Abstract:
This paper presents ICAT, an evaluation framework for measuring coverage of diverse factual information in long-form text generation. ICAT breaks down a long output text into a list of atomic claims and not only verifies each claim through retrieval from a (reliable) knowledge source, but also computes the alignment between the atomic factual claims and various aspects expected to be presented in the output. We study three implementations of the ICAT framework, each with a different assumption on the availability of aspects and alignment method. By adopting data from the diversification task in the TREC Web Track and the ClueWeb corpus, we evaluate the ICAT framework. We demonstrate strong correlation with human judgments and provide comprehensive evaluation across multiple state-of-the-art LLMs. Our framework further offers interpretable and fine-grained analysis of diversity and coverage. Its modular design allows for easy adaptation to different domains and datasets, making it a valuable tool for evaluating the qualitative aspects of long-form responses produced by LLMs.
Chinese: ICAT框架通过将长文本分解为原子声明、检索验证其准确性并比对预期方面,评估生成文本中事实信息的多样性和覆盖范围,与人工评估高度一致,并提供可解释的细粒度分析。
English: ICAT is a framework that evaluates the diversity and coverage of factual information in long-form generated texts by decomposing them into atomic claims, verifying their accuracy via retrieval, and aligning them with expected aspects, showing strong correlation with human judgment and offering interpretable analysis.

Authors:Yongjeong Oh, Joohyuk Park, Jinho Choi, Jihong Park, Yo-Seb Jeon
Title: Blind Training for Channel-Adaptive Digital Semantic Communications
Abstract:
Semantic encoders and decoders for digital semantic communication (SC) often struggle to adapt to variations in unpredictable channel environments and diverse system designs. To address these challenges, this paper proposes a novel framework for training semantic encoders and decoders to enable channel-adaptive digital SC. The core idea is to use binary symmetric channel (BSC) as a universal representation of generic digital communications, eliminating the need to specify channel environments or system designs. Based on this idea, our framework employs parallel BSCs to equivalently model the relationship between the encoder's output and the decoder's input. The bit-flip probabilities of these BSCs are treated as trainable parameters during end-to-end training, with varying levels of regularization applied to address diverse requirements in practical systems. The advantage of our framework is justified by developing a training-aware communication strategy for the inference stage. This strategy makes communication bit errors align with the pre-trained bit-flip probabilities by adaptively selecting power and modulation levels based on practical requirements and channel conditions. Simulation results demonstrate that the proposed framework outperforms existing training approaches in terms of both task performance and power consumption.
中文摘要:本文提出了一种新颖的数字语义通信训练框架,通过将二进制对称信道作为通用通信模型并优化可训练的比特翻转概率,结合自适应功率调制策略,在任务性能和功耗方面均优于现有方法。
English Summary: This paper introduces a novel framework for training channel-adaptive digital semantic communication systems using binary symmetric channels as universal representations, with trainable bit-flip probabilities and adaptive power modulation that outperforms existing methods in task performance and power efficiency.

Authors:Masahiro Matsumoto, Abu Saleh Musa Miah, Nobuyoshi Asai, Jungpil Shin
Title: Machine Learning-Based Differential Diagnosis of Parkinson's Disease Using Kinematic Feature Extraction and Selection
Abstract:
Parkinson's disease (PD), the second most common neurodegenerative disorder, is characterized by dopaminergic neuron loss and the accumulation of abnormal synuclein. PD presents both motor and non-motor symptoms that progressively impair daily functioning. The severity of these symptoms is typically assessed using the MDS-UPDRS rating scale, which is subjective and dependent on the physician's experience. Additionally, PD shares symptoms with other neurodegenerative diseases, such as progressive supranuclear palsy (PSP) and multiple system atrophy (MSA), complicating accurate diagnosis. To address these diagnostic challenges, we propose a machine learning-based system for differential diagnosis of PD, PSP, MSA, and healthy controls (HC). This system utilizes a kinematic feature-based hierarchical feature extraction and selection approach. Initially, 18 kinematic features are extracted, including two newly proposed features: Thumb-to-index vector velocity and acceleration, which provide insights into motor control patterns. In addition, 41 statistical features were extracted here from each kinematic feature, including some new approaches such as Average Absolute Change, Rhythm, Amplitude, Frequency, Standard Deviation of Frequency, and Slope. Feature selection is performed using One-way ANOVA to rank features, followed by Sequential Forward Floating Selection (SFFS) to identify the most relevant ones, aiming to reduce the computational complexity. The final feature set is used for classification, achieving a classification accuracy of 66.67% for each dataset and 88.89% for each patient, with particularly high performance for the MSA and HC groups using the SVM algorithm. This system shows potential as a rapid and accurate diagnostic tool in clinical practice, though further data collection and refinement are needed to enhance its reliability.
中文: 本研究提出一种基于运动学和统计特征的机器学习系统,用于区分帕金森病与其他类似神经退行性疾病,对每位患者的分类准确率最高达88.89%,虽需进一步完善,但展现了作为临床诊断工具的潜力。
English: This study proposes a machine learning system using kinematic and statistical features to differentiate Parkinson's disease from similar neurodegenerative disorders, achieving up to 88.89% classification accuracy per patient and demonstrating potential as a clinical diagnostic tool despite needing further refinement.

Authors:Nikolaos Stathoulopoulos, Christoforos Kanellakis, George Nikolakopoulos
Title: Balancing Accuracy and Efficiency for Large-Scale SLAM: A Minimal Subset Approach for Scalable Loop Closures
Abstract:
Typical LiDAR SLAM architectures feature a front-end for odometry estimation and a back-end for refining and optimizing the trajectory and map, commonly through loop closures. However, loop closure detection in large-scale missions presents significant computational challenges due to the need to identify, verify, and process numerous candidate pairs for pose graph optimization. Keyframe sampling bridges the front-end and back-end by selecting frames for storing and processing during global optimization. This article proposes an online keyframe sampling approach that constructs the pose graph using the most impactful keyframes for loop closure. We introduce the Minimal Subset Approach (MSA), which optimizes two key objectives: redundancy minimization and information preservation, implemented within a sliding window framework. By operating in the feature space rather than 3-D space, MSA efficiently reduces redundant keyframes while retaining essential information. In sum, evaluations on diverse public datasets show that the proposed approach outperforms naive methods in reducing false positive rates in place recognition, while delivering superior ATE and RPE in metric localization, without the need for manual parameter tuning. Additionally, MSA demonstrates efficiency and scalability by reducing memory usage and computational overhead during loop closure detection and pose graph optimization.
中文摘要:本文提出了一种在线关键帧采样方法——最小子集方法(MSA),通过在特征空间中优化冗余最小化和信息保留,有效减少闭环检测中的冗余关键帧,在多个数据集上实现了更高的定位精度和更低的计算开销。
English Summary: This article introduces an online keyframe sampling method called the Minimal Subset Approach (MSA) that efficiently reduces redundant keyframes while preserving essential information, demonstrating superior performance in place recognition and metric localization across various datasets with reduced computational overhead.

Authors:Hanxin Zhu, Tianyu He, Xiqian Yu, Junliang Guo, Zhibo Chen, Jiang Bian
Title: AR4D: Autoregressive 4D Generation from Monocular Videos
Abstract:
Recent advancements in generative models have ignited substantial interest in dynamic 3D content creation (\ie, 4D generation). Existing approaches primarily rely on Score Distillation Sampling (SDS) to infer novel-view videos, typically leading to issues such as limited diversity, spatial-temporal inconsistency and poor prompt alignment, due to the inherent randomness of SDS. To tackle these problems, we propose AR4D, a novel paradigm for SDS-free 4D generation. Specifically, our paradigm consists of three stages. To begin with, for a monocular video that is either generated or captured, we first utilize pre-trained expert models to create a 3D representation of the first frame, which is further fine-tuned to serve as the canonical space. Subsequently, motivated by the fact that videos happen naturally in an autoregressive manner, we propose to generate each frame's 3D representation based on its previous frame's representation, as this autoregressive generation manner can facilitate more accurate geometry and motion estimation. Meanwhile, to prevent overfitting during this process, we introduce a progressive view sampling strategy, utilizing priors from pre-trained large-scale 3D reconstruction models. To avoid appearance drift introduced by autoregressive generation, we further incorporate a refinement stage based on a global deformation field and the geometry of each frame's 3D representation. Extensive experiments have demonstrated that AR4D can achieve state-of-the-art 4D generation without SDS, delivering greater diversity, improved spatial-temporal consistency, and better alignment with input prompts.
中文: AR4D框架提出了一种无需SDS的动态3D内容生成新范式,通过自回归生成和渐进式视角采样技术,实现了更优的多样性、时空一致性和文本对齐效果。
English: The AR4D framework introduces an SDS-free paradigm for dynamic 3D content creation, employing autoregressive generation and progressive view sampling to achieve superior diversity, consistency, and prompt alignment.

Authors:Ali Rabeh, Ethan Herron, Aditya Balu, Soumik Sarkar, Chinmay Hegde, Adarsh Krishnamurthy, Baskar Ganapathysubramanian
Title: Geometry Matters: Benchmarking Scientific ML Approaches for Flow Prediction around Complex Geometries
Abstract:
Rapid and accurate simulations of fluid dynamics around complicated geometric bodies are critical in a variety of engineering and scientific applications, including aerodynamics and biomedical flows. However, while scientific machine learning (SciML) has shown considerable promise, most studies in this field are limited to simple geometries, and complex, real-world scenarios are underexplored. This paper addresses this gap by benchmarking diverse SciML models, including neural operators and vision transformer-based foundation models, for fluid flow prediction over intricate geometries. Using a high-fidelity dataset of steady-state flows across various geometries, we evaluate the impact of geometric representations -- Signed Distance Fields (SDF) and binary masks -- on model accuracy, scalability, and generalization. Central to this effort is the introduction of a novel, unified scoring framework that integrates metrics for global accuracy, boundary layer fidelity, and physical consistency to enable a robust, comparative evaluation of model performance. Our findings demonstrate that newer foundation models significantly outperform neural operators, particularly in data-limited scenarios, and that SDF representations yield superior results with sufficient training data. Despite these promises, all models struggle with out-of-distribution generalization, highlighting a critical challenge for future SciML applications. By advancing both evaluation models and modeling capabilities, our work paves the way for robust and scalable ML solutions for fluid dynamics across complex geometries.
中文: 本文针对复杂几何体上的流体流动预测,对多种科学机器学习模型进行了基准测试,发现基础模型优于神经算子且符号距离场表示能提高精度,但所有模型在分布外泛化方面仍面临挑战。
English: This paper benchmarks various scientific machine learning models for fluid flow prediction over complex geometries, finding that foundation models outperform neural operators and that signed distance field representations enhance accuracy, though all models face challenges with out-of-distribution generalization.

Authors:Yixu Wang, Tianle Gu, Yan Teng, Yingchun Wang, Xingjun Ma
Title: HoneypotNet: Backdoor Attacks Against Model Extraction
Abstract:
Model extraction attacks are one type of inference-time attacks that approximate the functionality and performance of a black-box victim model by launching a certain number of queries to the model and then leveraging the model's predictions to train a substitute model. These attacks pose severe security threats to production models and MLaaS platforms and could cause significant monetary losses to the model owners. A body of work has proposed to defend machine learning models against model extraction attacks, including both active defense methods that modify the model's outputs or increase the query overhead to avoid extraction and passive defense methods that detect malicious queries or leverage watermarks to perform post-verification. In this work, we introduce a new defense paradigm called attack as defense which modifies the model's output to be poisonous such that any malicious users that attempt to use the output to train a substitute model will be poisoned. To this end, we propose a novel lightweight backdoor attack method dubbed HoneypotNet that replaces the classification layer of the victim model with a honeypot layer and then fine-tunes the honeypot layer with a shadow model (to simulate model extraction) via bi-level optimization to modify its output to be poisonous while remaining the original performance. We empirically demonstrate on four commonly used benchmark datasets that HoneypotNet can inject backdoors into substitute models with a high success rate. The injected backdoor not only facilitates ownership verification but also disrupts the functionality of substitute models, serving as a significant deterrent to model extraction attacks.
中文: 模型提取攻击通过查询训练替代模型威胁机器学习安全,而提出的HoneypotNet防御方法能在攻击过程中注入后门,既保持原模型性能又破坏替代模型功能。
English: Model extraction attacks threaten machine learning models by training substitute models through queries, but the proposed HoneypotNet defense injects backdoors during such attacks to poison substitute models while preserving original performance.

Authors:Yifei Zhou, Thomas Kämpfe, Kai Ni, Hussam Amrouch, Cheng Zhuo, Xunzhao Yin
Title: TReCiM: Lower Power and Temperature-Resilient Multibit 2FeFET-1T Compute-in-Memory Design
Abstract:
Compute-in-memory (CiM) emerges as a promising solution to solve hardware challenges in artificial intelligence (AI) and the Internet of Things (IoT), particularly addressing the "memory wall" issue. By utilizing nonvolatile memory (NVM) devices in a crossbar structure, CiM efficiently accelerates multiply-accumulate (MAC) computations, the crucial operations in neural networks and other AI models. Among various NVM devices, Ferroelectric FET (FeFET) is particularly appealing for ultra-low-power CiM arrays due to its CMOS compatibility, voltage-driven write/read mechanisms and high ION/IOFF ratio. Moreover, subthreshold-operated FeFETs, which operate at scaling voltages in the subthreshold region, can further minimize the power consumption of CiM array. However, subthreshold-FeFETs are susceptible to temperature drift, resulting in computation accuracy degradation. Existing solutions exhibit weak temperature resilience at larger array size and only support 1-bit. In this paper, we propose TReCiM, an ultra-low-power temperature-resilient multibit 2FeFET-1T CiM design that reliably performs MAC operations in the subthreshold-FeFET region with temperature ranging from 0 to 85 degrees Celcius at scale. We benchmark our design using NeuroSim framework in the context of VGG-8 neural network architecture running the CIFAR-10 dataset. Benchmarking results suggest that when considering temperature drift impact, our proposed TReCiM array achieves 91.31% accuracy, with 1.86% accuracy improvement compared to existing 1-bit 2T-1FeFET CiM array. Furthermore, our proposed design achieves 48.03 TOPS/W energy efficiency at system level, comparable to existing designs with smaller technology feature sizes.
中文: 采用FeFET技术的存内计算(CiM)为解决AI和IoT中的存储墙问题提供了前景广阔的方案,所提出的TReCiM设计实现了超低功耗、温度自适应的多比特运算,在宽温范围内保持高精度和能效表现。
English: Compute-in-memory (CiM) using FeFET technology offers a promising solution to overcome the memory wall in AI and IoT applications, with the proposed TReCiM design enabling ultra-low-power, temperature-resilient multibit operations that achieve high accuracy and energy efficiency across varying temperatures.

Authors:Yulong Li, Yuxuan Zhang, Feilong Tang, Ming Hu, Zhixiang Lu, Haochen Xue, Jianghao Wu, Mian Zhou, Kang Dang, Chong Li, Yifang Wang, Imran Razzak, Jionglong Su
Title: Beyond Words: AuralLLM and SignMST-C for Sign Language Production and Bidirectional Accessibility
Abstract:
Sign language is the primary communication mode for 72 million hearing-impaired individuals worldwide, necessitating effective bidirectional Sign Language Production and Sign Language Translation systems. However, functional bidirectional systems require a unified linguistic environment, hindered by the lack of suitable unified datasets, particularly those providing the necessary pose information for accurate Sign Language Production (SLP) evaluation. Concurrently, current SLP evaluation methods like back-translation ignore pose accuracy, and high-quality coordinated generation remains challenging. To create this crucial environment and overcome these challenges, we introduce CNText2Sign and CNSign, which together constitute the first unified dataset aimed at supporting bidirectional accessibility systems for Chinese sign language; CNText2Sign provides 15,000 natural language-to-sign mappings and standardized skeletal keypoints for 8,643 vocabulary items supporting pose assessment. Building upon this foundation, we propose the AuraLLM model, which leverages a decoupled architecture with CNText2Sign's pose data for novel direct gesture accuracy assessment. The model employs retrieval augmentation and Cascading Vocabulary Resolution to handle semantic mapping and out-of-vocabulary words and achieves all-scenario production with controllable coordination of gestures and facial expressions via pose-conditioned video synthesis. Concurrently, our Sign Language Translation model SignMST-C employs targeted self-supervised pretraining for dynamic feature capture, achieving new SOTA results on PHOENIX2014-T with BLEU-4 scores up to 32.08. AuraLLM establishes a strong performance baseline on CNText2Sign with a BLEU-4 score of 50.41 under direct evaluation.
中文摘要:本研究针对手语系统缺乏统一数据集和姿势评估的问题,推出了首个中文双向手语统一数据集及AuraLLM模型,通过解耦架构实现精准手势评估,并在翻译任务中取得最优性能。
English Summary: This research introduces the first unified dataset and AuraLLM model for bidirectional Chinese sign language systems, addressing the lack of pose-accurate evaluation methods and achieving state-of-the-art translation performance with novel gesture assessment capabilities.

Authors:Cheng-Hau Yang, Guglielmo Scovazzi, Adarsh Krishnamurthy, Baskar Ganapathysubramanian
Title: A Shifted Boundary Method for Thermal Flows
Abstract:
This paper presents an incomplete Octree mesh implementation of the Shifted Boundary Method (Octree-SBM) for multiphysics simulations of coupled flow and heat transfer. Specifically, a semi-implicit formulation of the thermal Navier-Stokes equations is used to accelerate the simulations while maintaining accuracy. The SBM enables precise enforcement of field and derivative boundary conditions on cut (intercepted) elements, allowing for accurate flux calculations near complex geometries, when using non-boundary fitted meshes. Both Dirichlet and Neumann boundary conditions are implemented within the SBM framework, with results demonstrating that the SBM ensures precise enforcement of Neumann boundary conditions on Octree-based meshes. We illustrate this approach by simulating flows across different regimes, spanning several orders of magnitude in both the Rayleigh number ($Ra \sim 10^3$--$10^9$) and the Reynolds number ($Re \sim 10^0$--$10^4$), and covering the laminar, transitional, and turbulent flow regimes. Coupled thermal-flow phenomena and their statistics across all these regimes are accurately captured without any additional numerical treatments, beyond a Residual-based Variational Multiscale formulation (RB-VMS). This approach offers a reliable and efficient solution for complex geometries, boundary conditions and flow regimes in computational multiphysics simulations.
中文: 本文提出了一种基于不完全八叉树网格的移位边界方法,用于流热耦合多物理场模拟,能够在复杂几何和边界条件下精确实施边界条件并计算通量,无需额外数值处理即可准确捕捉多种流态下的热流耦合现象。
English: This paper introduces an incomplete Octree mesh implementation of the Shifted Boundary Method for efficient multiphysics simulations of coupled flow and heat transfer, enabling accurate boundary condition enforcement and flux calculations across diverse flow regimes without additional numerical treatments.

Authors:Antoine Scheid, Etienne Boursier, Alain Durmus, Eric Moulines, Michael Jordan
Title: Online Decision-Making in Tree-Like Multi-Agent Games with Transfers
Abstract:
The widespread deployment of Machine Learning systems everywhere raises challenges, such as dealing with interactions or competition between multiple learners. In that goal, we study multi-agent sequential decision-making by considering principal-agent interactions in a tree structure. In this problem, the reward of a player is influenced by the actions of her children, who are all self-interested and non-cooperative, hence the complexity of making good decisions. Our main finding is that it is possible to steer all the players towards the globally optimal set of actions by simply allowing single-step transfers between them. A transfer is established between a principal and one of her agents: the principal actually offers the proposed payment if the agent picks the recommended action. The analysis poses specific challenges due to the intricate interactions between the nodes of the tree and the propagation of the regret within this tree. Considering a bandit setup, we propose algorithmic solutions for the players to end up being no-regret with respect to the optimal pair of actions and incentives. In the long run, allowing transfers between players makes them act as if they were collaborating together, although they remain self-interested non-cooperative: transfers restore efficiency.
中文摘要:研究表明,在多智能体决策中,自私非合作的参与者之间通过单步转移支付即可引导其实现全局最优行动,从而有效恢复系统效率。
English Summary: The study demonstrates that single-step transfers between self-interested players in multi-agent decision-making can steer them toward globally optimal actions, effectively restoring efficiency despite their non-cooperative nature.

Authors:Ying Zang, Runlong Cao, Jianqi Zhang, Yidong Han, Ziyue Cao, Wenjun Hu, Didi Zhu, Lanyun Zhu, Zejian Li, Deyi Ji, Tianrun Chen
Title: Let Human Sketches Help: Empowering Challenging Image Segmentation Task with Freehand Sketches
Abstract:
Sketches, with their expressive potential, allow humans to convey the essence of an object through even a rough contour. For the first time, we harness this expressive potential to improve segmentation performance in challenging tasks like camouflaged object detection (COD). Our approach introduces an innovative sketch-guided interactive segmentation framework, allowing users to intuitively annotate objects with freehand sketches (drawing a rough contour of the object) instead of the traditional bounding boxes or points used in classic interactive segmentation models like SAM. We demonstrate that sketch input can significantly improve performance in existing iterative segmentation methods, outperforming text or bounding box annotations. Additionally, we introduce key modifications to network architectures and a novel sketch augmentation technique to fully harness the power of sketch input and further boost segmentation accuracy. Remarkably, our model' s output can be directly used to train other neural networks, achieving results comparable to pixel-by-pixel annotations--while reducing annotation time by up to 120 times, which shows great potential in democratizing the annotation process and enabling model training with less reliance on resource-intensive, laborious pixel-level annotations. We also present KOSCamo+, the first freehand sketch dataset for camouflaged object detection. The dataset, code, and the labeling tool will be open sourced.
中文摘要:本研究提出了一种草图引导的交互式分割框架,利用手绘草图显著提升伪装目标检测性能,在达到像素级标注精度的同时将标注时间减少高达120倍。
English Summary: This study introduces a sketch-guided interactive segmentation framework that uses freehand sketches to significantly enhance camouflaged object detection performance, reducing annotation time by up to 120 times while achieving accuracy comparable to pixel-level annotations.

Authors:Aymeric Capitaine, Etienne Boursier, Eric Moulines, Michael I. Jordan, Alain Durmus
Title: Prediction-Aware Learning in Multi-Agent Systems
Abstract:
The framework of uncoupled online learning in multiplayer games has made significant progress in recent years. In particular, the development of time-varying games has considerably expanded its modeling capabilities. However, current regret bounds quickly become vacuous when the game undergoes significant variations over time, even when these variations are easy to predict. Intuitively, the ability of players to forecast future payoffs should lead to tighter guarantees, yet existing approaches fail to incorporate this aspect. This work aims to fill this gap by introducing a novel prediction-aware framework for time-varying games, where agents can forecast future payoffs and adapt their strategies accordingly. In this framework, payoffs depend on an underlying state of nature that agents predict in an online manner. To leverage these predictions, we propose the POWMU algorithm, a contextual extension of the optimistic Multiplicative Weight Update algorithm, for which we establish theoretical guarantees on social welfare and convergence to equilibrium. Our results demonstrate that, under bounded prediction errors, the proposed framework achieves performance comparable to the static setting. Finally, we empirically demonstrate the effectiveness of POWMU in a traffic routing experiment.
中文摘要:本文提出了一种针对时变博弈的预测感知框架,通过POWMU算法使智能体能够预测未来收益并在有界预测误差下实现接近静态环境的性能,同时提供了理论保证并在交通路由实验中进行了实证验证。
English Summary: This paper introduces a prediction-aware framework for time-varying games, where agents forecast future payoffs using the POWMU algorithm to achieve near-static performance under bounded prediction errors, with theoretical guarantees and empirical validation in traffic routing.

Authors:Ching-Chun Chang, Fan-Yun Chen, Shih-Hong Gu, Kai Gao, Hanrui Wang, Isao Echizen
Title: Imitation Game for Adversarial Disillusion with Multimodal Generative Chain-of-Thought Role-Play
Abstract:
As the cornerstone of artificial intelligence, machine perception confronts a fundamental threat posed by adversarial illusions. These adversarial attacks manifest in two primary forms: deductive illusion, where specific stimuli are crafted based on the victim model's general decision logic, and inductive illusion, where the victim model's general decision logic is shaped by specific stimuli. The former exploits the model's decision boundaries to create a stimulus that, when applied, interferes with its decision-making process. The latter reinforces a conditioned reflex in the model, embedding a backdoor during its learning phase that, when triggered by a stimulus, causes aberrant behaviours. The multifaceted nature of adversarial illusions calls for a unified defence framework, addressing vulnerabilities across various forms of attack. In this study, we propose a disillusion paradigm based on the concept of an imitation game. At the heart of the imitation game lies a multimodal generative agent, steered by chain-of-thought reasoning, which observes, internalises and reconstructs the semantic essence of a sample, liberated from the classic pursuit of reversing the sample to its original state. As a proof of concept, we conduct experimental simulations using a multimodal generative dialogue agent and evaluates the methodology under a variety of attack scenarios.
中文摘要:本研究针对机器感知中的对抗性幻觉威胁,提出基于模仿游戏概念的去幻觉范式,通过多模态生成代理在思维链引导下观察、内化并重构样本语义本质,而非传统还原样本原状的方式,构建统一防御框架。
English Summary: This study introduces a unified defense framework against adversarial illusions in machine perception, proposing a disillusion paradigm using a multimodal generative agent guided by chain-of-thought reasoning to reconstruct samples' semantic essence without reverting to their original state.

Authors:Pinxin Liu, Luchuan Song, Junhua Huang, Haiyang Liu, Chenliang Xu
Title: GestureLSM: Latent Shortcut based Co-Speech Gesture Generation with Spatial-Temporal Modeling
Abstract:
Generating full-body human gestures based on speech signals remains challenges on quality and speed. Existing approaches model different body regions such as body, legs and hands separately, which fail to capture the spatial interactions between them and result in unnatural and disjointed movements. Additionally, their autoregressive/diffusion-based pipelines show slow generation speed due to dozens of inference steps. To address these two challenges, we propose GestureLSM, a flow-matching-based approach for Co-Speech Gesture Generation with spatial-temporal modeling. Our method i) explicitly model the interaction of tokenized body regions through spatial and temporal attention, for generating coherent full-body gestures. ii) introduce the flow matching to enable more efficient sampling by explicitly modeling the latent velocity space. To overcome the suboptimal performance of flow matching baseline, we propose latent shortcut learning and beta distribution time stamp sampling during training to enhance gesture synthesis quality and accelerate inference. Combining the spatial-temporal modeling and improved flow matching-based framework, GestureLSM achieves state-of-the-art performance on BEAT2 while significantly reducing inference time compared to existing methods, highlighting its potential for enhancing digital humans and embodied agents in real-world applications. Project Page: https://andypinxinliu.github.io/GestureLSM
中文摘要:GestureLSM采用基于流匹配的时空建模方法,通过显式建模身体区域间的交互和优化潜在速度空间,实现了语音驱动全身手势的高质量生成与快速推理。
English Summary: GestureLSM introduces a flow-matching approach with spatial-temporal modeling to generate synchronized full-body gestures from speech, achieving superior quality and faster inference by capturing inter-region interactions and optimizing latent velocity space.

Authors:Minh Nhat Vu, Florian Grander, Anh Nguyen
Title: Online Trajectory Replanner for Dynamically Grasping Irregular Objects
Abstract:
This paper presents a new trajectory replanner for grasping irregular objects. Unlike conventional grasping tasks where the object's geometry is assumed simple, we aim to achieve a "dynamic grasp" of the irregular objects, which requires continuous adjustment during the grasping process. To effectively handle irregular objects, we propose a trajectory optimization framework that comprises two phases. Firstly, in a specified time limit of 10s, initial offline trajectories are computed for a seamless motion from an initial configuration of the robot to grasp the object and deliver it to a pre-defined target location. Secondly, fast online trajectory optimization is implemented to update robot trajectories in real-time within 100 ms. This helps to mitigate pose estimation errors from the vision system. To account for model inaccuracies, disturbances, and other non-modeled effects, trajectory tracking controllers for both the robot and the gripper are implemented to execute the optimal trajectories from the proposed framework. The intensive experimental results effectively demonstrate the performance of our trajectory planning framework in both simulation and real-world scenarios.
中文: 本文提出了一种双阶段轨迹优化框架,通过离线规划生成初始抓取轨迹,并采用在线实时优化来修正视觉误差,实现对不规则物体的动态精准抓取。
English: This paper introduces a dual-phase trajectory optimization framework for dynamically grasping irregular objects, featuring offline planning for initial motion and real-time online adjustments to correct errors and ensure precise execution.

Authors:Lanyun Zhu, Tianrun Chen, Deyi Ji, Jieping Ye, Jun Liu
Title: Not Every Patch is Needed: Towards a More Efficient and Effective Backbone for Video-based Person Re-identification
Abstract:
This paper proposes a new effective and efficient plug-and-play backbone for video-based person re-identification (ReID). Conventional video-based ReID methods typically use CNN or transformer backbones to extract deep features for every position in every sampled video frame. Here, we argue that this exhaustive feature extraction could be unnecessary, since we find that different frames in a ReID video often exhibit small differences and contain many similar regions due to the relatively slight movements of human beings. Inspired by this, a more selective, efficient paradigm is explored in this paper. Specifically, we introduce a patch selection mechanism to reduce computational cost by choosing only the crucial and non-repetitive patches for feature extraction. Additionally, we present a novel network structure that generates and utilizes pseudo frame global context to address the issue of incomplete views resulting from sparse inputs. By incorporating these new designs, our backbone can achieve both high performance and low computational cost. Extensive experiments on multiple datasets show that our approach reduces the computational cost by 74\% compared to ViT-B and 28\% compared to ResNet50, while the accuracy is on par with ViT-B and outperforms ResNet50 significantly.
Chinese: 本文提出了一种高效的即插即用视频行人重识别骨干网络,通过选择性地提取关键图像块特征并利用伪帧全局上下文,在保持与ViT-B相当精度的同时将计算成本降低了74%。
English: This paper introduces an efficient plug-and-play backbone for video-based person re-identification that reduces computational costs by selectively extracting features from crucial patches and using pseudo frame global context, achieving comparable accuracy to ViT-B with 74% less computation.

Authors:Meiyun Cao, Shaw Hu, Jason Sharp, Edward Clouser, Jason Holmes, Linda L. Lam, Xiaoning Ding, Diego Santos Toesca, Wendy S. Lindholm, Samir H. Patel, Sujay A. Vora, Peilong Wang, Wei Liu
Title: Evaluating The Performance of Using Large Language Models to Automate Summarization of CT Simulation Orders in Radiation Oncology
Abstract:
Purpose: This study aims to use a large language model (LLM) to automate the generation of summaries from the CT simulation orders and evaluate its performance. Materials and Methods: A total of 607 CT simulation orders for patients were collected from the Aria database at our institution. A locally hosted Llama 3.1 405B model, accessed via the Application Programming Interface (API) service, was used to extract keywords from the CT simulation orders and generate summaries. The downloaded CT simulation orders were categorized into seven groups based on treatment modalities and disease sites. For each group, a customized instruction prompt was developed collaboratively with therapists to guide the Llama 3.1 405B model in generating summaries. The ground truth for the corresponding summaries was manually derived by carefully reviewing each CT simulation order and subsequently verified by therapists. The accuracy of the LLM-generated summaries was evaluated by therapists using the verified ground truth as a reference. Results: About 98% of the LLM-generated summaries aligned with the manually generated ground truth in terms of accuracy. Our evaluations showed an improved consistency in format and enhanced readability of the LLM-generated summaries compared to the corresponding therapists-generated summaries. This automated approach demonstrated a consistent performance across all groups, regardless of modality or disease site. Conclusions: This study demonstrated the high precision and consistency of the Llama 3.1 405B model in extracting keywords and summarizing CT simulation orders, suggesting that LLMs have great potential to help with this task, reduce the workload of therapists and improve workflow efficiency.
中文: 本研究证明Llama 3.1 405B模型在自动生成CT模拟定位单摘要方面达到98%的准确率,显示出减轻治疗师工作负担和提升工作流程效率的巨大潜力。
English: This study demonstrates that the Llama 3.1 405B model achieves 98% accuracy in automatically summarizing CT simulation orders, showing strong potential to reduce therapist workload and improve workflow efficiency.

Authors:Haorui He, Zengqiang Shang, Chaoren Wang, Xuyuan Li, Yicheng Gu, Hua Hua, Liwei Liu, Chen Yang, Jiaqi Li, Peiyang Shi, Yuancheng Wang, Kai Chen, Pengyuan Zhang, Zhizheng Wu
Title: Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation
Abstract:
Recent advancements in speech generation have been driven by the large-scale training datasets. However, current models fall short of capturing the spontaneity and variability inherent in real-world human speech, due to their reliance on audiobook datasets limited to formal read-aloud speech styles. To bridge this gap, we introduce Emilia-Pipe, an open-source preprocessing pipeline to extract high-quality training data from valuable yet underexplored in-the-wild data that capture spontaneous human speech in real-world contexts. By leveraging Emilia-Pipe, we construct Emilia, the first multilingual speech generation dataset derived from in-the-wild speech data. This dataset comprises over 101k hours of speech across six languages: English, Chinese, German, French, Japanese, and Korean. Besides, we expand Emilia to Emilia-Large, a dataset exceeding 216k hours, making it the largest open-source speech generation dataset available. Extensive experiments demonstrate that Emilia significantly outperforms traditional audiobook datasets in generating spontaneous and human-like speech, showcasing superior performance in capturing diverse speaker timbre and speaking styles of real-world human speech. Furthermore, this work underscores the importance of scaling dataset size to advance speech generation research and validates the effectiveness of Emilia for both multilingual and crosslingual speech generation.
中文摘要:Emilia-Pipe是一个开源预处理流程,能从真实语境中提取高质量语音数据构建大规模多语言数据集,使训练出的语音生成模型比基于传统朗读式数据训练的模型更具自然度和表现力。
English Summary: Emilia-Pipe is an open-source pipeline that creates large-scale, multilingual speech datasets from spontaneous real-world sources, enabling models to generate more natural and varied human-like speech than those trained on traditional audiobook data.

Authors:Haorui He, Zengqiang Shang, Chaoren Wang, Xuyuan Li, Yicheng Gu, Hua Hua, Liwei Liu, Chen Yang, Jiaqi Li, Peiyang Shi, Yuancheng Wang, Kai Chen, Pengyuan Zhang, Zhizheng Wu
Title: Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation
Abstract:
Recent advancements in speech generation have been driven by large-scale training datasets. However, current models struggle to capture the spontaneity and variability inherent in real-world human speech, as they are primarily trained on audio-book datasets limited to formal, read-aloud speaking styles. To address this limitation, we introduce Emilia-Pipe, an open-source preprocessing pipeline designed to extract high-quality training data from valuable yet under-explored in-the-wild sources that capture spontaneous human speech in real-world contexts. Using Emilia-Pipe, we construct Emilia, which comprises over 101k hours of speech across six languages: English, Chinese, German, French, Japanese, and Korean. Furthermore, we expand Emilia to Emilia-Large, a dataset exceeding 216k hours, making it one of the largest open-source speech generation resources available. Extensive experiments show that Emilia-trained models produce markedly more spontaneous, human-like speech than those trained on traditional audio-book datasets, while matching their intelligibility. These models better capture diverse speaker timbres and the full spectrum of real-world conversational styles. Our work also highlights the importance of scaling dataset size for advancing speech generation performance and validates the effectiveness of Emilia for both multilingual and crosslingual speech generation tasks.
中文摘要:Emilia-Pipe是一个开源预处理流程,能从真实语境中提取高质量语音数据构建大规模多语言数据集,使训练出的语音生成模型比基于传统朗读式数据训练的模型更具自然度和表现力。
English Summary: Emilia-Pipe is an open-source pipeline that creates large-scale, multilingual speech datasets from spontaneous real-world sources, enabling models to generate more natural and varied human-like speech than those trained on traditional audiobook data.

Authors:Shavbo Salehi, Pedro Enrique Iturria-Rivera, Medhat Elsayed, Majid Bavand, Raimundas Gaigalas, Yigit Ozcan, Melike Erol-Kantarci
Title: Prioritized Value-Decomposition Network for Explainable AI-Enabled Network Slicing
Abstract:
Network slicing aims to enhance flexibility and efficiency in next-generation wireless networks by allocating the right resources to meet the diverse requirements of various applications. Managing these slices with machine learning (ML) algorithms has emerged as a promising approach however explainability has been a challenge. To this end, several Explainable Artificial Intelligence (XAI) frameworks have been proposed to address the opacity in decision-making in many ML methods. In this paper, we propose a Prioritized Value-Decomposition Network (PVDN) as an XAI-driven approach for resource allocation in a multi-agent network slicing system. The PVDN method decomposes the global value function into individual contributions and prioritizes slice outputs, providing an explanation of how resource allocation decisions impact system performance. By incorporating XAI, PVDN offers valuable insights into the decision-making process, enabling network operators to better understand, trust, and optimize slice management strategies. Through simulations, we demonstrate the effectiveness of the PVDN approach with improving the throughput by 67% and 16%, while reducing latency by 35% and 22%, compared to independent and VDN-based resource allocation methods.
Chinese: 本文提出了一种优先价值分解网络(PVDN)作为网络切片中资源分配的可解释人工智能方法,在提高吞吐量达67%、降低延迟达35%的同时,为决策过程提供了透明化的解读。
English: This paper introduces a Prioritized Value-Decomposition Network (PVDN) as an explainable AI approach for resource allocation in network slicing, which enhances throughput by up to 67% and reduces latency by up to 35% while providing transparent decision-making insights.

Authors:Gongning Luo, Mingwang Xu, Hongyu Chen, Xinjie Liang, Xing Tao, Dong Ni, Hyunsu Jeong, Chulhong Kim, Raphael Stock, Michael Baumgartner, Yannick Kirchhoff, Maximilian Rokuss, Klaus Maier-Hein, Zhikai Yang, Tianyu Fan, Nicolas Boutry, Dmitry Tereshchenko, Arthur Moine, Maximilien Charmetant, Jan Sauer, Hao Du, Xiang-Hui Bai, Vipul Pai Raikar, Ricardo Montoya-del-Angel, Robert Marti, Miguel Luna, Dongmin Lee, Abdul Qayyum, Moona Mazher, Qihui Guo, Changyan Wang, Navchetan Awasthi, Qiaochu Zhao, Wei Wang, Kuanquan Wang, Qiucheng Wang, Suyu Dong
Title: Tumor Detection, Segmentation and Classification Challenge on Automated 3D Breast Ultrasound: The TDSC-ABUS Challenge
Abstract:
Breast cancer is one of the most common causes of death among women worldwide. Early detection helps in reducing the number of deaths. Automated 3D Breast Ultrasound (ABUS) is a newer approach for breast screening, which has many advantages over handheld mammography such as safety, speed, and higher detection rate of breast cancer. Tumor detection, segmentation, and classification are key components in the analysis of medical images, especially challenging in the context of 3D ABUS due to the significant variability in tumor size and shape, unclear tumor boundaries, and a low signal-to-noise ratio. The lack of publicly accessible, well-labeled ABUS datasets further hinders the advancement of systems for breast tumor analysis. Addressing this gap, we have organized the inaugural Tumor Detection, Segmentation, and Classification Challenge on Automated 3D Breast Ultrasound 2023 (TDSC-ABUS2023). This initiative aims to spearhead research in this field and create a definitive benchmark for tasks associated with 3D ABUS image analysis. In this paper, we summarize the top-performing algorithms from the challenge and provide critical analysis for ABUS image examination. We offer the TDSC-ABUS challenge as an open-access platform at https://tdsc-abus2023.grand-challenge.org/ to benchmark and inspire future developments in algorithmic research.
中文:TDSC-ABUS2023挑战赛通过建立公开基准平台推进三维自动乳腺超声的肿瘤检测、分割与分类算法研究,本文总结了最优算法性能并为后续研究提供开放访问资源。
English: The TDSC-ABUS2023 challenge addresses the limitations in 3D Automated Breast Ultrasound analysis by establishing a public benchmark to advance tumor detection, segmentation, and classification algorithms, with this paper summarizing top-performing methods and providing an open-access platform for ongoing research.

Authors:Junan Zhang, Jing Yang, Zihao Fang, Yuancheng Wang, Zehua Zhang, Zhuo Wang, Fan Fan, Zhizheng Wu
Title: AnyEnhance: A Unified Generative Model with Prompt-Guidance and Self-Critic for Voice Enhancement
Abstract:
We introduce AnyEnhance, a unified generative model for voice enhancement that processes both speech and singing voices. Based on a masked generative model, AnyEnhance is capable of handling both speech and singing voices, supporting a wide range of enhancement tasks including denoising, dereverberation, declipping, super-resolution, and target speaker extraction, all simultaneously and without fine-tuning. AnyEnhance introduces a prompt-guidance mechanism for in-context learning, which allows the model to natively accept a reference speaker's timbre. In this way, it could boost enhancement performance when a reference audio is available and enable the target speaker extraction task without altering the underlying architecture. Moreover, we also introduce a self-critic mechanism into the generative process for masked generative models, yielding higher-quality outputs through iterative self-assessment and refinement. Extensive experiments on various enhancement tasks demonstrate AnyEnhance outperforms existing methods in terms of both objective metrics and subjective listening tests. Demo audios are publicly available at https://amphionspace.github.io/anyenhance/.
中文:AnyEnhance是一个统一的生成模型,通过提示引导的上下文学习和迭代自优化机制,能够同时处理语音和歌声的多种增强任务,在各项测试中均优于现有方法。
English: AnyEnhance is a unified generative model that simultaneously handles multiple voice enhancement tasks for both speech and singing voices through prompt-guided in-context learning and iterative self-refinement, outperforming existing methods across various benchmarks.

Authors:Jieming Cao, Chen Huang, Yanan Zhang, Ruibo Deng, Jincheng Zhang, Wenqiang Lei
Title: Breaking the Stigma! Unobtrusively Probe Symptoms in Depression Disorder Diagnosis Dialogue
Abstract:
Stigma has emerged as one of the major obstacles to effectively diagnosing depression, as it prevents users from open conversations about their struggles. This requires advanced questioning skills to carefully probe the presence of specific symptoms in an unobtrusive manner. While recent efforts have been made on depression-diagnosis-oriented dialogue systems, they largely ignore this problem, ultimately hampering their practical utility. To this end, we propose a novel and effective method, UPSD$^{4}$, developing a series of strategies to promote a sense of unobtrusiveness within the dialogue system and assessing depression disorder by probing symptoms. We experimentally show that UPSD$^{4}$ demonstrates a significant improvement over current baselines, including unobtrusiveness evaluation of dialogue content and diagnostic accuracy. We believe our work contributes to developing more accessible and user-friendly tools for addressing the widespread need for depression diagnosis.
中文: 提出的UPSD⁴方法通过引入非侵入式提问策略来克服抑郁症诊断中的污名化障碍,显著提升了对话系统的隐蔽性和诊断准确性。
English: The proposed UPSD⁴ method enhances dialogue systems by incorporating unobtrusive questioning strategies to overcome stigma barriers in depression diagnosis, significantly improving both conversational subtlety and diagnostic accuracy.

Authors:Zhongyi Shui, Jianpeng Zhang, Weiwei Cao, Sinuo Wang, Ruizhe Guo, Le Lu, Lin Yang, Xianghua Ye, Tingbo Liang, Qi Zhang, Ling Zhang
Title: Large-scale and Fine-grained Vision-language Pre-training for Enhanced CT Image Understanding
Abstract:
Artificial intelligence (AI) shows great potential in assisting radiologists to improve the efficiency and accuracy of medical image interpretation and diagnosis. However, a versatile AI model requires large-scale data and comprehensive annotations, which are often impractical in medical settings. Recent studies leverage radiology reports as a naturally high-quality supervision for medical images, using contrastive language-image pre-training (CLIP) to develop language-informed models for radiological image interpretation. Nonetheless, these approaches typically contrast entire images with reports, neglecting the local associations between imaging regions and report sentences, which may undermine model performance and interoperability. In this paper, we propose a fine-grained vision-language model (fVLM) for anatomy-level CT image interpretation. Specifically, we explicitly match anatomical regions of CT images with corresponding descriptions in radiology reports and perform contrastive pre-training for each anatomy individually. Fine-grained alignment, however, faces considerable false-negative challenges, mainly from the abundance of anatomy-level healthy samples and similarly diseased abnormalities. To tackle this issue, we propose identifying false negatives of both normal and abnormal samples and calibrating contrastive learning from patient-level to disease-aware pairing. We curated the largest CT dataset to date, comprising imaging and report data from 69,086 patients, and conducted a comprehensive evaluation of 54 major and important disease diagnosis tasks across 15 main anatomies. Experimental results demonstrate the substantial potential of fVLM in versatile medical image interpretation. In the zero-shot classification task, we achieved an average AUC of 81.3% on 54 diagnosis tasks, surpassing CLIP and supervised methods by 12.9% and 8.0%, respectively.
中文: 本文提出了一种细粒度视觉语言模型(fVLM),通过将CT图像的解剖区域与放射学报告中的对应描述进行对齐,并采用校准对比学习解决假阴性问题,在54种疾病诊断任务中实现了零样本分类的卓越性能。
English: This paper introduces a fine-grained vision-language model (fVLM) that aligns anatomical regions in CT images with corresponding radiology report sentences, addressing false-negative challenges through calibrated contrastive learning and achieving superior performance in zero-shot disease diagnosis across 54 tasks.

Authors:Hongyu Chen, Min Zhou, Jing Jiang, Jiale Chen, Yang Lu, Zihang Lin, Bo Xiao, Tiezheng Ge, Bo Zheng
Title: T-Stars-Poster: A Framework for Product-Centric Advertising Image Design
Abstract:
Creating advertising images is often a labor-intensive and time-consuming process. Can we automatically generate such images using basic product information like a product foreground image, taglines, and a target size? Existing methods mainly focus on parts of the problem and lack a comprehensive solution. To bridge this gap, we propose a novel product-centric framework for advertising image design called T-Stars-Poster. It consists of four sequential stages to highlight product foregrounds and taglines while achieving overall image aesthetics: prompt generation, layout generation, background image generation, and graphics rendering. Different expert models are designed and trained for the first three stages: First, a visual language model (VLM) generates background prompts that match the products. Next, a VLM-based layout generation model arranges the placement of product foregrounds, graphic elements (taglines and decorative underlays), and various nongraphic elements (objects from the background prompt). Following this, an SDXL-based model can simultaneously accept prompts, layouts, and foreground controls to generate images. To support T-Stars-Poster, we create two corresponding datasets with over 50,000 labeled images. Extensive experiments and online A/B tests demonstrate that T-Stars-Poster can produce more visually appealing advertising images.
中文:T-Stars-Poster框架通过提示生成、布局生成、背景图像生成和图形渲染四个阶段,采用专业模型自动生成突出产品前景和标语的高质量广告图像,实验和A/B测试验证了其视觉吸引力。
English: The proposed T-Stars-Poster framework automates advertising image creation through four sequential stages—prompt generation, layout generation, background image generation, and graphics rendering—using specialized models to enhance product visibility and aesthetic appeal, validated by experiments and A/B tests.

Authors:Hanrui Wang, Ching-Chun Chang, Chun-Shien Lu, Christopher Leckie, Isao Echizen
Title: GreedyPixel: Fine-Grained Black-Box Adversarial Attack Via Greedy Algorithm
Abstract:
A critical requirement for deep learning models is ensuring their robustness against adversarial attacks. These attacks commonly introduce noticeable perturbations, compromising the visual fidelity of adversarial examples. Another key challenge is that while white-box algorithms can generate effective adversarial perturbations, they require access to the model gradients, limiting their practicality in many real-world scenarios. Existing attack mechanisms struggle to achieve similar efficacy without access to these gradients. In this paper, we introduce GreedyPixel, a novel pixel-wise greedy algorithm designed to generate high-quality adversarial examples using only query-based feedback from the target model. GreedyPixel improves computational efficiency in what is typically a brute-force process by perturbing individual pixels in sequence, guided by a pixel-wise priority map. This priority map is constructed by ranking gradients obtained from a surrogate model, providing a structured path for perturbation. Our results demonstrate that GreedyPixel achieves attack success rates comparable to white-box methods without the need for gradient information, and surpasses existing algorithms in black-box settings, offering higher success rates, reduced computational time, and imperceptible perturbations. These findings underscore the advantages of GreedyPixel in terms of attack efficacy, time efficiency, and visual quality.
中文摘要:GreedyPixel是一种新颖的逐像素贪婪算法,仅通过查询反馈即可高效生成高质量对抗样本,在无需梯度信息的情况下达到白盒攻击成功率,同时实现难以察觉的扰动和卓越的计算效率。
English Summary: GreedyPixel is a novel pixel-wise greedy algorithm that efficiently generates high-quality adversarial examples using only query-based feedback, achieving white-box level attack success rates without gradient access while ensuring imperceptible perturbations and superior computational efficiency.

Authors:Wailing Tang, Biqi Yang, Pheng-Ann Heng, Yun-Hui Liu, Chi-Wing Fu
Title: Overcoming Support Dilution for Robust Few-shot Semantic Segmentation
Abstract:
Few-shot Semantic Segmentation (FSS) is a challenging task that utilizes limited support images to segment associated unseen objects in query images. However, recent FSS methods are observed to perform worse, when enlarging the number of shots. As the support set enlarges, existing FSS networks struggle to concentrate on the high-contributed supports and could easily be overwhelmed by the low-contributed supports that could severely impair the mask predictions. In this work, we study this challenging issue, called support dilution, our goal is to recognize, select, preserve, and enhance those high-contributed supports in the raw support pool. Technically, our method contains three novel parts. First, we propose a contribution index, to quantitatively estimate if a high-contributed support dilutes. Second, we develop the Symmetric Correlation (SC) module to preserve and enhance the high-contributed support features, minimizing the distraction by the low-contributed features. Third, we design the Support Image Pruning operation, to retrieve a compact and high quality subset by discarding low-contributed supports. We conduct extensive experiments on two FSS benchmarks, COCO-20i and PASCAL-5i, the segmentation results demonstrate the compelling performance of our solution over state-of-the-art FSS approaches. Besides, we apply our solution for online segmentation and real-world segmentation, convincing segmentation results showing the practical ability of our work for real-world demonstrations.
Chinese: 本研究针对少样本语义分割中的支持稀释问题,通过引入贡献度指标、对称相关模块和支持图像剪枝技术,有效增强高贡献支持并提升在基准测试中的分割精度。
English: This study addresses the challenge of support dilution in Few-shot Semantic Segmentation by introducing a contribution index, a Symmetric Correlation module, and Support Image Pruning to enhance high-contributed supports and improve segmentation accuracy on benchmarks.

Authors:Alessio Bucaioni, Martin Weyssow, Junda He, Yunbo Lyu, David Lo
Title: A Functional Software Reference Architecture for LLM-Integrated Systems
Abstract:
The integration of large language models into software systems is transforming capabilities such as natural language understanding, decision-making, and autonomous task execution. However, the absence of a commonly accepted software reference architecture hinders systematic reasoning about their design and quality attributes. This gap makes it challenging to address critical concerns like privacy, security, modularity, and interoperability, which are increasingly important as these systems grow in complexity and societal impact. In this paper, we describe our \textit{emerging} results for a preliminary functional reference architecture as a conceptual framework to address these challenges and guide the design, evaluation, and evolution of large language model-integrated systems. We identify key architectural concerns for these systems, informed by current research and practice. We then evaluate how the architecture addresses these concerns and validate its applicability using three open-source large language model-integrated systems in computer vision, text processing, and coding.
中文: 大型语言模型融入软件系统虽提升了功能,但缺乏统一参考架构,本文提出初步框架以解决设计和质量问题,并通过开源系统验证其适用性。
English: The integration of large language models into software systems enhances capabilities but lacks a standard reference architecture, prompting the proposal of a preliminary framework to address design and quality concerns, validated through open-source systems.

Authors:Jiayu Liu, Fuhui Zhou, Xiaodong Liu, Rui Ding, Lu Yuan, Qihui Wu
Title: Data-and-Semantic Dual-Driven Spectrum Map Construction for 6G Spectrum Management
Abstract:
Spectrum maps reflect the utilization and distribution of spectrum resources in the electromagnetic environment, serving as an effective approach to support spectrum management. However, the construction of spectrum maps in urban environments is challenging because of high-density connection and complex terrain. Moreover, the existing spectrum map construction methods are typically applied to a fixed frequency, which cannot cover the entire frequency band. To address the aforementioned challenges, a UNet-based data-and-semantic dual-driven method is proposed by introducing the semantic knowledge of binary city maps and binary sampling location maps to enhance the accuracy of spectrum map construction in complex urban environments with dense communications. Moreover, a joint frequency-space reasoning model is exploited to capture the correlation of spectrum data in terms of space and frequency, enabling the realization of complete spectrum map construction without sampling all frequencies of spectrum data. The simulation results demonstrate that the proposed method can infer the spectrum utilization status of missing frequencies and improve the completeness of the spectrum map construction. Furthermore, the accuracy of spectrum map construction achieved by the proposed data-and-semantic dual-driven method outperforms the benchmark schemes, especially in scenarios with low sampling density.
中文: 该方法通过引入语义知识与联合频空推理模型,在复杂城市环境中提升了频谱地图构建的精度,无需全频段采样即可实现完整构建,且在低采样密度场景下显著优于基准方案。
English: The proposed UNet-based data-and-semantic dual-driven method enhances spectrum map accuracy in complex urban environments by incorporating semantic knowledge and a joint frequency-space reasoning model, enabling complete construction without full-frequency sampling and outperforming benchmarks, especially under low sampling density.

Authors:Simone Göttlich, Benedikt Oppeneiger, Manuel Schaller, Karl Worthmann
Title: Spatial exponential decay of perturbations in optimal control of general evolution equations
Abstract:
We analyze the robustness of optimally controlled evolution equations with respect to spatially localized perturbations. We prove that if the involved operators are domain-uniformly stabilizable and detectable, then these localized perturbations only have a local effect on the optimal solution. We characterize this domain-uniform stabilizability and detectability for the transport equation with constant transport velocity, showing that even for unitary semigroups, optimality implies exponential damping. We extend this result to the case of a space-dependent transport velocity. Finally we leverage the results for the transport equation to characterize domain-uniform stabilizability of the wave equation. Numerical examples in one space dimension complement the theoretical results.
中文摘要:本研究证明,当算子满足区域一致可镇定和可检测条件时,空间局部扰动对最优控制演化方程仅产生局部影响,并通过一维数值算例验证了理论结果。
English Summary: This study demonstrates that optimally controlled evolution equations remain robust under spatially localized perturbations, showing such disturbances only have local effects when operators are domain-uniformly stabilizable and detectable, with numerical examples supporting the theoretical findings.

Authors:Simone Göttlich, Benedikt Oppeneiger, Manuel Schaller, Karl Worthmann
Title: Spatial exponential decay of perturbations in optimal control of general evolution equations
Abstract:
We analyze the robustness of optimally controlled evolution equations with respect to spatially localized perturbations. We prove that if the involved operators are domain-uniformly stabilizable and detectable, then these localized perturbations only have a local effect on the optimal solution. We characterize this domain-uniform stabilizability and detectability for the transport equation with constant transport velocity, showing that even for unitary semigroups, optimality implies exponential damping. We extend this result to the case of a space-dependent transport velocity. Finally we leverage the results for the transport equation to characterize domain-uniform stabilizability of the wave equation. Numerical examples in one space dimension complement the theoretical results.
中文摘要:本研究证明,当算子满足区域一致可镇定和可检测条件时,空间局部扰动对最优控制演化方程仅产生局部影响,并通过一维数值算例验证了理论结果。
English Summary: This study demonstrates that optimally controlled evolution equations remain robust under spatially localized perturbations, showing such disturbances only have local effects when operators are domain-uniformly stabilizable and detectable, with numerical examples supporting the theoretical findings.

Authors:Yuanbin Chen, Xufeng Guo, Chau Yuen, Yufei Zhao, Yong Liang Guan, Chong Meng Samson See, Merouane Débbah, Lajos Hanzo
Title: Harnessing Rydberg Atomic Receivers: From Quantum Physics to Wireless Communications
Abstract:
The intrinsic integration of Rydberg atomic receivers into wireless communication systems is proposed, by harnessing the principles of quantum physics in wireless communications. More particularly, we conceive a pair of Rydberg atomic receivers, one incorporates a local oscillator (LO), referred to as an LO-dressed receiver, while the other operates without an LO and is termed an LO-free receiver. The appropriate wireless model is developed for each configuration, elaborating on the receiver's responses to the radio frequency (RF) signal, on the potential noise sources, and on the signal-to-noise ratio (SNR) performance. The developed wireless model conforms to the classical RF framework, facilitating compatibility with established signal processing methodologies. Next, we investigate the associated distortion effects that might occur, specifically identifying the conditions under which distortion arises and demonstrating the boundaries of linear dynamic ranges. This provides critical insights into its practical implementations in wireless systems. Finally, extensive simulation results are provided for characterizing the performance of wireless systems, harnessing this pair of Rydberg atomic receivers. Our results demonstrate that LO-dressed systems achieve a significant SNR gain of approximately 40~50 dB over conventional RF receivers in the standard quantum limit regime. This SNR head-room translates into reduced symbol error rates, enabling efficient and reliable transmission with higher-order constellations.
中文摘要:本研究提出将里德堡原子接收器集成到无线通信系统中,通过仿真证明采用本地振荡器的接收器比传统系统实现40-50分贝的信噪比提升,能以更低误码率支持高阶星座图传输。
English Summary: This study proposes integrating Rydberg atomic receivers into wireless systems, demonstrating through simulations that LO-dressed receivers achieve 40-50 dB SNR improvement over conventional systems, enabling higher-order constellation transmissions with reduced error rates.

Authors:Evgeny Saveliev, Jiashuo Liu, Nabeel Seedat, Anders Boyd, Mihaela van der Schaar
Title: Towards Human-Guided, Data-Centric LLM Co-Pilots
Abstract:
Machine learning (ML) has the potential to revolutionize various domains, but its adoption is often hindered by the disconnect between the needs of domain experts and translating these needs into robust and valid ML tools. Despite recent advances in LLM-based co-pilots to democratize ML for non-technical domain experts, these systems remain predominantly focused on model-centric aspects while overlooking critical data-centric challenges. This limitation is problematic in complex real-world settings where raw data often contains complex issues, such as missing values, label noise, and domain-specific nuances requiring tailored handling. To address this we introduce CliMB-DC, a human-guided, data-centric framework for LLM co-pilots that combines advanced data-centric tools with LLM-driven reasoning to enable robust, context-aware data processing. At its core, CliMB-DC introduces a novel, multi-agent reasoning system that combines a strategic coordinator for dynamic planning and adaptation with a specialized worker agent for precise execution. Domain expertise is then systematically incorporated to guide the reasoning process using a human-in-the-loop approach. To guide development, we formalize a taxonomy of key data-centric challenges that co-pilots must address. Thereafter, to address the dimensions of the taxonomy, we integrate state-of-the-art data-centric tools into an extensible, open-source architecture, facilitating the addition of new tools from the research community. Empirically, using real-world healthcare datasets we demonstrate CliMB-DC's ability to transform uncurated datasets into ML-ready formats, significantly outperforming existing co-pilot baselines for handling data-centric challenges. CliMB-DC promises to empower domain experts from diverse domains -- healthcare, finance, social sciences and more -- to actively participate in driving real-world impact using ML.
中文:机器学习应用常因领域专家需求与技术实现脱节而受阻,尤其在数据层面,CliMB-DC通过结合多智能体推理与先进工具的人类引导框架,有效解决数据挑战,将原始数据转化为可直接用于机器学习的形式。
English: Machine learning adoption is hindered by a disconnect between domain experts' needs and technical implementation, particularly in data-centric challenges, which CliMB-DC addresses through a human-guided framework combining multi-agent reasoning with advanced tools to transform raw data into ML-ready formats.

Authors:Yuecheng Liu, Dafeng Chi, Shiguang Wu, Zhanguang Zhang, Yaochen Hu, Lingfeng Zhang, Yingxue Zhang, Shuang Wu, Tongtong Cao, Guowei Huang, Helong Huang, Guangjian Tian, Weichao Qiu, Xingyue Quan, Jianye Hao, Yuzheng Zhuang
Title: SpatialCoT: Advancing Spatial Reasoning through Coordinate Alignment and Chain-of-Thought for Embodied Task Planning
Abstract:
Spatial reasoning is an essential problem in embodied AI research. Efforts to enhance spatial reasoning abilities through supplementary spatial data and fine-tuning have proven limited and ineffective when addressing complex embodied tasks, largely due to their dependence on language-based outputs. While some approaches have introduced a point-based action space to mitigate this issue, they fall short in managing more intricate tasks within complex environments. This deficiency arises from their failure to fully exploit the inherent thinking and reasoning capabilities that are fundamental strengths of Vision-Language Models (VLMs). To address these limitations, we propose a novel approach named SpatialCoT, specifically designed to bolster the spatial reasoning capabilities of VLMs. Our approach comprises two stages: spatial coordinate bi-directional alignment, which aligns vision-language inputs with spatial coordinates, and chain-of-thought spatial grounding, which harnesses the reasoning capabilities of language models for advanced spatial reasoning. We evaluate SpatialCoT on challenging navigation and manipulation tasks, both in simulation and real-world settings. Experimental results demonstrate that our method significantly outperforms previous state-of-the-art approaches in both tasks.
中文: SpatialCoT通过坐标双向对齐和思维链空间定位增强视觉语言模型的空间推理能力,在复杂导航和操作任务中显著优于现有方法。
English: SpatialCoT enhances Vision-Language Models' spatial reasoning through coordinate alignment and chain-of-thought grounding, achieving superior performance in complex navigation and manipulation tasks.

Authors:Marc Cheong, Sankwi Abuzo, Hideaki Hata, Priscilla Kevin, Winifred Kula, Benson Mirou, Christoph Treude, Dong Wang, Raula Gaikovina Kula
Title: Building Bridges across Papua New Guinea's Digital Divide in Growing the ICT Industry
Abstract:
Papua New Guinea (PNG) is an emerging tech society with an opportunity to overcome geographic and social boundaries, in order to engage with the global market. However, the current tech landscape, dominated by Big Tech in Silicon Valley and other multinational companies in the Global North, tends to overlook the requirements of emerging economies such as PNG. This is becoming more obvious as issues such as algorithmic bias (in tech product deployments) and the digital divide (as in the case of non-affordable commercial software) are affecting PNG users. The Open Source Software (OSS) movement, based on extant research, is seen as a way to level the playing field in the digitalization and adoption of Information and Communications Technologies (ICTs) in PNG. This perspectives paper documents the outcome of the second International Workshop on BRIdging the Divides with Globally Engineered Software} (BRIDGES2023) in the hopes of proposing ideas for future research into ICT education, uplifting software engineering (SE) capability, and OSS adoption in promoting a more equitable digital future for PNG.
巴布亚新几内亚可通过开源软件解决数字不平等问题并促进包容性技术发展,这是BRIDGES2023研讨会成果所强调的。
Papua New Guinea can leverage Open Source Software to address digital inequities and foster inclusive technological growth, as highlighted in the BRIDGES2023 workshop findings.

Authors:Moises Diaz, Miguel A. Ferrer, Jose J. Quintana
Title: Anthropomorphic Features for On-Line Signatures
Abstract:
Many features have been proposed in on-line signature verification. Generally, these features rely on the position of the on-line signature samples and their dynamic properties, as recorded by a tablet. This paper proposes a novel feature space to describe efficiently on-line signatures. Since producing a signature requires a skeletal arm system and its associated muscles, the new feature space is based on characterizing the movement of the shoulder, the elbow and the wrist joints when signing. As this motion is not directly obtained from a digital tablet, the new features are calculated by means of a virtual skeletal arm (VSA) model, which simulates the architecture of a real arm and forearm. Specifically, the VSA motion is described by its 3D joint position and its joint angles. These anthropomorphic features are worked out from both pen position and orientation through the VSA forward and direct kinematic model. The anthropomorphic features' robustness is proved by achieving state-of-the-art performance with several verifiers and multiple benchmarks on third party signature databases, which were collected with different devices and in different languages and scripts.
中文摘要:本文提出一种基于虚拟骨骼臂模型模拟人体手臂关节运动的新型在线签名特征空间,在多种设备和语言的第三方签名数据库上实现了最优验证性能。
English Summary: This paper introduces a novel feature space for online signature verification based on simulating human arm joint movements through a virtual skeletal arm model, achieving state-of-the-art performance across diverse databases and devices.

Authors:Hao Tang, Ling Shao, Nicu Sebe, Luc Van Gool
Title: Enhanced Multi-Scale Cross-Attention for Person Image Generation
Abstract:
In this paper, we propose a novel cross-attention-based generative adversarial network (GAN) for the challenging person image generation task. Cross-attention is a novel and intuitive multi-modal fusion method in which an attention/correlation matrix is calculated between two feature maps of different modalities. Specifically, we propose the novel XingGAN (or CrossingGAN), which consists of two generation branches that capture the person's appearance and shape, respectively. Moreover, we propose two novel cross-attention blocks to effectively transfer and update the person's shape and appearance embeddings for mutual improvement. This has not been considered by any other existing GAN-based image generation work. To further learn the long-range correlations between different person poses at different scales and sub-regions, we propose two novel multi-scale cross-attention blocks. To tackle the issue of independent correlation computations within the cross-attention mechanism leading to noisy and ambiguous attention weights, which hinder performance improvements, we propose a module called enhanced attention (EA). Lastly, we introduce a novel densely connected co-attention module to fuse appearance and shape features at different stages effectively. Extensive experiments on two public datasets demonstrate that the proposed method outperforms current GAN-based methods and performs on par with diffusion-based methods. However, our method is significantly faster than diffusion-based methods in both training and inference.
中文: 本文提出XingGAN,一种基于交叉注意力的生成对抗网络,通过创新的注意力机制相互增强人体形状和外观特征,在保持与扩散模型相当质量的同时大幅提升速度,显著优于现有GAN方法。
English: This paper introduces XingGAN, a novel cross-attention-based generative adversarial network that effectively generates person images by mutually enhancing shape and appearance features through innovative attention mechanisms, outperforming existing GAN methods and matching diffusion models' quality with significantly faster speed.

Authors:Kuicai Dong, Yujing Chang, Xin Deik Goh, Dexun Li, Ruiming Tang, Yong Liu
Title: MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents
Abstract:
Multimodal document retrieval aims to identify and retrieve various forms of multimodal content, such as figures, tables, charts, and layout information from extensive documents. Despite its increasing popularity, there is a notable lack of a comprehensive and robust benchmark to effectively evaluate the performance of systems in such tasks. To address this gap, this work introduces a new benchmark, named MMDocIR, that encompasses two distinct tasks: page-level and layout-level retrieval. The former evaluates the performance of identifying the most relevant pages within a long document, while the later assesses the ability of detecting specific layouts, providing a more fine-grained measure than whole-page analysis. A layout refers to a variety of elements, including textual paragraphs, equations, figures, tables, or charts. The MMDocIR benchmark comprises a rich dataset featuring 1,685 questions annotated by experts and 173,843 questions with bootstrapped labels, making it a valuable resource in multimodal document retrieval for both training and evaluation. Through rigorous experiments, we demonstrate that (i) visual retrievers significantly outperform their text counterparts, (ii) MMDocIR training set effectively enhances the performance of multimodal document retrieval and (iii) text retrievers leveraging VLM-text significantly outperforms retrievers relying on OCR-text. Our dataset is available at https://mmdocrag.github.io/MMDocIR/.
中文: 本研究提出了MMDocIR这一综合性基准,用于评估多模态文档检索,包含页面级和布局级任务,结合专家标注和自举数据集,有效提升了系统性能评估与训练效果。
English: This work introduces MMDocIR, a comprehensive benchmark for multimodal document retrieval that includes page-level and layout-level tasks, featuring expert-annotated and bootstrapped datasets to enhance system performance evaluation and training.

Authors:Tobias Rohe, Florian Burger, Michael Kölle, Sebastian Wölckert, Maximilian Zorn, Claudia Linnhoff-Popien
Title: Investigating Parameter-Efficiency of Hybrid QuGANs Based on Geometric Properties of Generated Sea Route Graphs
Abstract:
The demand for artificially generated data for the development, training and testing of new algorithms is omnipresent. Quantum computing (QC), does offer the hope that its inherent probabilistic functionality can be utilised in this field of generative artificial intelligence. In this study, we use quantum-classical hybrid generative adversarial networks (QuGANs) to artificially generate graphs of shipping routes. We create a training dataset based on real shipping data and investigate to what extent QuGANs are able to learn and reproduce inherent distributions and geometric features of this data. We compare hybrid QuGANs with classical Generative Adversarial Networks (GANs), with a special focus on their parameter efficiency. Our results indicate that QuGANs are indeed able to quickly learn and represent underlying geometric properties and distributions, although they seem to have difficulties in introducing variance into the sampled data. Compared to classical GANs of greater size, measured in the number of parameters used, some QuGANs show similar result quality. Our reference to concrete use cases, such as the generation of shipping data, provides an illustrative example and demonstrate the potential and diversity in which QC can be used.
中文: 研究表明,量子-经典混合生成对抗网络(QuGANs)能有效学习并复制真实航运路线数据的几何特征和分布,在与参数更多的经典GANs性能相当的同时,也显示出在数据多样性生成方面的局限性。
English: This study demonstrates that quantum-classical hybrid generative adversarial networks (QuGANs) can effectively learn and replicate the geometric properties and distributions of real shipping route data, showing comparable performance to larger classical GANs while highlighting challenges in data variance generation.

Authors:Runqi Wang, Yang Chen, Sijie Xu, Tianyao He, Wei Zhu, Dejia Song, Nemo Chen, Xu Tang, Yao Hu
Title: DynamicFace: High-Quality and Consistent Face Swapping for Image and Video using Composable 3D Facial Priors
Abstract:
Face swapping transfers the identity of a source face to a target face while retaining the attributes like expression, pose, hair, and background of the target face. Advanced face swapping methods have achieved attractive results. However, these methods often inadvertently transfer identity information from the target face, compromising expression-related details and accurate identity. We propose a novel method DynamicFace that leverages the power of diffusion models and plug-and-play adaptive attention layers for image and video face swapping. First, we introduce four fine-grained facial conditions using 3D facial priors. All conditions are designed to be disentangled from each other for precise and unique control. Then, we adopt Face Former and ReferenceNet for high-level and detailed identity injection. Through experiments on the FF++ dataset, we demonstrate that our method achieves state-of-the-art results in face swapping, showcasing superior image quality, identity preservation, and expression accuracy. Our framework seamlessly adapts to both image and video domains. Our code and results will be available on the project page: https://dynamic-face.github.io/
中文:所提出的DynamicFace方法利用扩散模型和自适应注意力层,通过解耦面部条件并注入身份信息实现精确换脸,在图像和视频领域均展现出最先进的性能表现。
English: The proposed DynamicFace method utilizes diffusion models and adaptive attention layers to achieve precise face swapping by disentangling facial conditions and injecting identity information, demonstrating state-of-the-art performance in both image and video domains.

Authors:Runxin Han, Bo Yang, Zhiwen Yu, Xuelin Cao, George C. Alexandropoulos, Chau Yuen
Title: Multi-task Domain Adaptation for Computation Offloading in Edge-intelligence Networks
Abstract:
In the field of multi-access edge computing (MEC), efficient computation offloading is crucial for improving resource utilization and reducing latency in dynamically changing environments. This paper introduces a new approach, termed as Multi-Task Domain Adaptation (MTDA), aiming to enhance the ability of computational offloading models to generalize in the presence of domain shifts, i.e., when new data in the target environment significantly differs from the data in the source domain. The proposed MTDA model incorporates a teacher-student architecture that allows continuous adaptation without necessitating access to the source domain data during inference, thereby maintaining privacy and reducing computational overhead. Utilizing a multi-task learning framework that simultaneously manages offloading decisions and resource allocation, the proposed MTDA approach outperforms benchmark methods regarding mean squared error and accuracy, particularly in environments with increasing numbers of users. It is observed by means of computer simulation that the proposed MTDA model maintains high performance across various scenarios, demonstrating its potential for practical deployment in emerging MEC applications.
中文: 本文提出的多任务域自适应(MTDA)模型通过教师-学生架构和多任务学习框架,在多接入边缘计算中实现了跨域环境下的高效计算卸载,无需源域数据即可保持优越性能。
English: This paper introduces a Multi-Task Domain Adaptation (MTDA) model that enhances computation offloading in multi-access edge computing by maintaining high performance across domain shifts without requiring source domain data during inference.

Authors:Yangsibo Huang, Milad Nasr, Anastasios Angelopoulos, Nicholas Carlini, Wei-Lin Chiang, Christopher A. Choquette-Choo, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Ken Ziyu Liu, Ion Stoica, Florian Tramer, Chiyuan Zhang
Title: Exploring and Mitigating Adversarial Manipulation of Voting-Based Leaderboards
Abstract:
It is now common to evaluate Large Language Models (LLMs) by having humans manually vote to evaluate model outputs, in contrast to typical benchmarks that evaluate knowledge or skill at some particular task. Chatbot Arena, the most popular benchmark of this type, ranks models by asking users to select the better response between two randomly selected models (without revealing which model was responsible for the generations). These platforms are widely trusted as a fair and accurate measure of LLM capabilities. In this paper, we show that if bot protection and other defenses are not implemented, these voting-based benchmarks are potentially vulnerable to adversarial manipulation. Specifically, we show that an attacker can alter the leaderboard (to promote their favorite model or demote competitors) at the cost of roughly a thousand votes (verified in a simulated, offline version of Chatbot Arena). Our attack consists of two steps: first, we show how an attacker can determine which model was used to generate a given reply with more than $95\%$ accuracy; and then, the attacker can use this information to consistently vote for (or against) a target model. Working with the Chatbot Arena developers, we identify, propose, and implement mitigations to improve the robustness of Chatbot Arena against adversarial manipulation, which, based on our analysis, substantially increases the cost of such attacks. Some of these defenses were present before our collaboration, such as bot protection with Cloudflare, malicious user detection, and rate limiting. Others, including reCAPTCHA and login are being integrated to strengthen the security in Chatbot Arena.
中文摘要:本文揭示,若缺乏机器人防护,基于投票的大型语言模型评估基准(如Chatbot Arena)易受对抗性操纵,攻击者可通过识别模型来源并策略性投票以少量成本篡改排行榜,促使平台集成reCAPTCHA和登录等强化防御措施。
English Summary: This paper reveals that voting-based benchmarks for evaluating Large Language Models, such as Chatbot Arena, are vulnerable to adversarial manipulation if bot protection is absent, allowing attackers to alter leaderboards with minimal votes by identifying model sources and voting strategically, prompting the implementation of enhanced defenses like reCAPTCHA and login systems.

Authors:Zhengyan Sheng, Zhihao Du, Heng Lu, Shiliang Zhang, Zhen-Hua Ling
Title: Unispeaker: A Unified Approach for Multimodality-driven Speaker Generation
Abstract:
Recent advancements in personalized speech generation have brought synthetic speech increasingly close to the realism of target speakers' recordings, yet multimodal speaker generation remains on the rise. This paper introduces UniSpeaker, a unified approach for multimodality-driven speaker generation. Specifically, we propose a unified voice aggregator based on KV-Former, applying soft contrastive loss to map diverse voice description modalities into a shared voice space, ensuring that the generated voice aligns more closely with the input descriptions. To evaluate multimodality-driven voice control, we build the first multimodality-based voice control (MVC) benchmark, focusing on voice suitability, voice diversity, and speech quality. UniSpeaker is evaluated across five tasks using the MVC benchmark, and the experimental results demonstrate that UniSpeaker outperforms previous modality-specific models. Speech samples are available at \url{https://UniSpeaker.github.io}.
中文: 本文提出UniSpeaker统一多模态说话人生成系统,通过基于KV-Former的语音聚合器和软对比损失将多模态语音描述映射到共享空间,在新建的多模态语音控制基准测试的五个任务中均优于现有模型。
English: This paper presents UniSpeaker, a unified multimodal speaker generation system that employs a KV-Former-based voice aggregator with soft contrastive loss to map diverse voice descriptions into a shared space, outperforming previous models across five tasks on the newly introduced Multimodality-based Voice Control benchmark.

Authors:Vighnesh Subramaniam, Yilun Du, Joshua B. Tenenbaum, Antonio Torralba, Shuang Li, Igor Mordatch
Title: Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains
Abstract:
Large language models (LLMs) have achieved remarkable performance in recent years but are fundamentally limited by the underlying training data. To improve models beyond the training data, recent works have explored how LLMs can be used to generate synthetic data for autonomous self-improvement. However, successive steps of self-improvement can reach a point of diminishing returns. In this work, we propose a complementary approach towards self-improvement where finetuning is applied to a multiagent society of language models. A group of language models, all starting from the same base model, are independently specialized by updating each one using data generated through multiagent interactions among the models. By training each model on independent sets of data, we illustrate how this approach enables specialization across models and diversification over the set of models. As a result, our overall system is able to preserve diverse reasoning chains and autonomously improve over many more rounds of fine-tuning than single-agent self-improvement methods. We quantitatively illustrate the efficacy of the approach across a wide suite of reasoning tasks.
中文: 本研究提出一种多智能体语言模型社会,通过模型间交互生成的数据进行独立专业化,实现了超越单智能体方法的持续自我改进和多样化推理能力。
English: This study introduces a multiagent society of language models that undergo independent specialization through data generated from their interactions, enabling sustained self-improvement and diverse reasoning beyond the limitations of single-agent methods.

Authors:David Noever, Forrest McKee
Title: Infecting Generative AI With Viruses
Abstract:
This study demonstrates a novel approach to testing the security boundaries of Vision-Large Language Model (VLM/ LLM) using the EICAR test file embedded within JPEG images. We successfully executed four distinct protocols across multiple LLM platforms, including OpenAI GPT-4o, Microsoft Copilot, Google Gemini 1.5 Pro, and Anthropic Claude 3.5 Sonnet. The experiments validated that a modified JPEG containing the EICAR signature could be uploaded, manipulated, and potentially executed within LLM virtual workspaces. Key findings include: 1) consistent ability to mask the EICAR string in image metadata without detection, 2) successful extraction of the test file using Python-based manipulation within LLM environments, and 3) demonstration of multiple obfuscation techniques including base64 encoding and string reversal. This research extends Microsoft Research's "Penetration Testing Rules of Engagement" framework to evaluate cloud-based generative AI and LLM security boundaries, particularly focusing on file handling and execution capabilities within containerized environments.
中文摘要:本研究通过将EICAR测试文件嵌入JPEG图像,开发出测试视觉大语言模型安全性的新方法,在多个LLM平台成功实现文件操作与执行,并能有效规避安全检测。
English Summary: This study introduces a novel method for testing Vision-Large Language Model security by embedding EICAR test files in JPEG images, successfully demonstrating file manipulation and execution across multiple LLM platforms while evading detection.

Authors:Fazl Barez, Tingchen Fu, Ameya Prabhu, Stephen Casper, Amartya Sanyal, Adel Bibi, Aidan O'Gara, Robert Kirk, Ben Bucknall, Tim Fist, Luke Ong, Philip Torr, Kwok-Yan Lam, Robert Trager, David Krueger, Sören Mindermann, José Hernandez-Orallo, Mor Geva, Yarin Gal
Title: Open Problems in Machine Unlearning for AI Safety
Abstract:
As AI systems become more capable, widely deployed, and increasingly autonomous in critical areas such as cybersecurity, biological research, and healthcare, ensuring their safety and alignment with human values is paramount. Machine unlearning -- the ability to selectively forget or suppress specific types of knowledge -- has shown promise for privacy and data removal tasks, which has been the primary focus of existing research. More recently, its potential application to AI safety has gained attention. In this paper, we identify key limitations that prevent unlearning from serving as a comprehensive solution for AI safety, particularly in managing dual-use knowledge in sensitive domains like cybersecurity and chemical, biological, radiological, and nuclear (CBRN) safety. In these contexts, information can be both beneficial and harmful, and models may combine seemingly harmless information for harmful purposes -- unlearning this information could strongly affect beneficial uses. We provide an overview of inherent constraints and open problems, including the broader side effects of unlearning dangerous knowledge, as well as previously unexplored tensions between unlearning and existing safety mechanisms. Finally, we investigate challenges related to evaluation, robustness, and the preservation of safety features during unlearning. By mapping these limitations and open challenges, we aim to guide future research toward realistic applications of unlearning within a broader AI safety framework, acknowledging its limitations and highlighting areas where alternative approaches may be required.
Chinese: 尽管机器遗忘通过选择性消除有害知识在AI安全领域展现出潜力,但在敏感的双重用途领域中存在显著局限,因为这种消除可能损害有益应用并与现有安全机制产生冲突。
English: While machine unlearning offers potential for AI safety by selectively removing harmful knowledge, it faces significant limitations in sensitive dual-use domains where such removal could impair beneficial applications and conflict with existing safety mechanisms.

Authors:Ching-Chun Chang, Yijie Lin, Isao Echizen
Title: Cyber-Physical Steganography in Robotic Motion Control
Abstract:
Steganography, the art of information hiding, has continually evolved across visual, auditory and linguistic domains, adapting to the ceaseless interplay between steganographic concealment and steganalytic revelation. This study seeks to extend the horizons of what constitutes a viable steganographic medium by introducing a steganographic paradigm in robotic motion control. Based on the observation of the robot's inherent sensitivity to changes in its environment, we propose a methodology to encode messages as environmental stimuli influencing the motions of the robotic agent and to decode messages from the resulting motion trajectory. The constraints of maximal robot integrity and minimal motion deviation are established as fundamental principles underlying secrecy. As a proof of concept, we conduct experiments in simulated environments across various manipulation tasks, incorporating robotic embodiments equipped with generalist multimodal policies.
中文摘要:本研究提出了一种新颖的隐写方法,通过将秘密信息编码为环境刺激来微妙改变机器人运动轨迹,以保持机器人完整性和最小运动偏差为核心保密原则,并在模拟操作任务中验证了该方法的可行性。
English Summary: This study introduces a novel steganographic method that encodes secret messages as environmental stimuli to subtly alter robotic motions, establishing motion integrity and minimal deviation as core secrecy principles while validating the approach through simulated manipulation tasks.

Authors:Minglin Chen, Longguang Wang, Sheng Ao, Ye Zhang, Kai Xu, Yulan Guo
Title: Layout2Scene: 3D Semantic Layout Guided Scene Generation via Geometry and Appearance Diffusion Priors
Abstract:
3D scene generation conditioned on text prompts has significantly progressed due to the development of 2D diffusion generation models. However, the textual description of 3D scenes is inherently inaccurate and lacks fine-grained control during training, leading to implausible scene generation. As an intuitive and feasible solution, the 3D layout allows for precise specification of object locations within the scene. To this end, we present a text-to-scene generation method (namely, Layout2Scene) using additional semantic layout as the prompt to inject precise control of 3D object positions. Specifically, we first introduce a scene hybrid representation to decouple objects and backgrounds, which is initialized via a pre-trained text-to-3D model. Then, we propose a two-stage scheme to optimize the geometry and appearance of the initialized scene separately. To fully leverage 2D diffusion priors in geometry and appearance generation, we introduce a semantic-guided geometry diffusion model and a semantic-geometry guided diffusion model which are finetuned on a scene dataset. Extensive experiments demonstrate that our method can generate more plausible and realistic scenes as compared to state-of-the-art approaches. Furthermore, the generated scene allows for flexible yet precise editing, thereby facilitating multiple downstream applications.
中文摘要:Layout2Scene方法通过引入语义布局作为提示,结合两阶段优化方案和扩散模型,实现了比现有技术更逼真且可灵活编辑的3D场景生成。
English Summary: The Layout2Scene method enhances 3D scene generation by incorporating semantic layouts for precise object positioning, utilizing a two-stage optimization process and diffusion models to produce more realistic and editable scenes than current techniques.

Authors:Youcheng Huang, Chen Huang, Duanyu Feng, Wenqiang Lei, Jiancheng Lv
Title: Cross-model Transferability among Large Language Models on the Platonic Representations of Concepts
Abstract:
Understanding the inner workings of Large Language Models (LLMs) is a critical research frontier. Prior research has shown that a single LLM's concept representations can be captured as steering vectors (SVs), enabling the control of LLM behavior (e.g., towards generating harmful content). Our work takes a novel approach by exploring the intricate relationships between concept representations across different LLMs, drawing an intriguing parallel to Plato's Allegory of the Cave. In particular, we introduce a linear transformation method to bridge these representations and present three key findings: 1) Concept representations across different LLMs can be effectively aligned using simple linear transformations, enabling efficient cross-model transfer and behavioral control via SVs. 2) This linear transformation generalizes across concepts, facilitating alignment and control of SVs representing different concepts across LLMs. 3) A weak-to-strong transferability exists between LLM concept representations, whereby SVs extracted from smaller LLMs can effectively control the behavior of larger LLMs.
Chinese: 本研究引入了一种线性变换方法,用于在不同大语言模型间对齐概念表征,实现了跨模型行为控制,并揭示了从小模型提取的引导向量能有效指导大模型行为的弱到强可迁移性。
English: This research introduces a linear transformation method to align concept representations across different Large Language Models, enabling cross-model behavioral control and revealing a weak-to-strong transferability where steering vectors from smaller models can effectively guide larger ones.

Authors:Gaurav Parmar, Or Patashnik, Kuan-Chieh Wang, Daniil Ostashev, Srinivasa Narasimhan, Jun-Yan Zhu, Daniel Cohen-Or, Kfir Aberman
Title: Object-level Visual Prompts for Compositional Image Generation
Abstract:
We introduce a method for composing object-level visual prompts within a text-to-image diffusion model. Our approach addresses the task of generating semantically coherent compositions across diverse scenes and styles, similar to the versatility and expressiveness offered by text prompts. A key challenge in this task is to preserve the identity of the objects depicted in the input visual prompts, while also generating diverse compositions across different images. To address this challenge, we introduce a new KV-mixed cross-attention mechanism, in which keys and values are learned from distinct visual representations. The keys are derived from an encoder with a small bottleneck for layout control, whereas the values come from a larger bottleneck encoder that captures fine-grained appearance details. By mixing keys and values from these complementary sources, our model preserves the identity of the visual prompts while supporting flexible variations in object arrangement, pose, and composition. During inference, we further propose object-level compositional guidance to improve the method's identity preservation and layout correctness. Results show that our technique produces diverse scene compositions that preserve the unique characteristics of each visual prompt, expanding the creative potential of text-to-image generation.
中文: 本文提出一种KV混合交叉注意力机制,通过分别从布局编码器和外观编码器学习键值对,使文本到图像扩散模型能够在使用物体级视觉提示时保持对象身份,同时生成多样化的场景构图。
English: This paper introduces a KV-mixed cross-attention mechanism for text-to-image diffusion models, enabling object-level visual prompts to maintain identity while generating diverse scene compositions through separate key and value learning from layout and appearance encoders.

Authors:Or Patashnik, Rinon Gal, Daniil Ostashev, Sergey Tulyakov, Kfir Aberman, Daniel Cohen-Or
Title: Nested Attention: Semantic-aware Attention Values for Concept Personalization
Abstract:
Personalizing text-to-image models to generate images of specific subjects across diverse scenes and styles is a rapidly advancing field. Current approaches often face challenges in maintaining a balance between identity preservation and alignment with the input text prompt. Some methods rely on a single textual token to represent a subject, which limits expressiveness, while others employ richer representations but disrupt the model's prior, diminishing prompt alignment. In this work, we introduce Nested Attention, a novel mechanism that injects a rich and expressive image representation into the model's existing cross-attention layers. Our key idea is to generate query-dependent subject values, derived from nested attention layers that learn to select relevant subject features for each region in the generated image. We integrate these nested layers into an encoder-based personalization method, and show that they enable high identity preservation while adhering to input text prompts. Our approach is general and can be trained on various domains. Additionally, its prior preservation allows us to combine multiple personalized subjects from different domains in a single image.
Chinese Summary: 嵌套注意力是一种创新机制,通过将丰富的图像表征注入交叉注意力层,在保持身份特征与提示对齐之间实现更优平衡,同时支持多主体图像合成。
English Summary: Nested Attention is a novel mechanism that enhances text-to-image personalization by injecting rich image representations into cross-attention layers, achieving superior balance between identity preservation and prompt alignment while enabling multi-subject composition.

Authors:Kaushik Roy, Harshul Surana, Darssan Eswaramoorthi, Yuxin Zi, Vedant Palit, Ritvik Garimella, Amit Sheth
Title: Large Language Models for Mental Health Diagnostic Assessments: Exploring The Potential of Large Language Models for Assisting with Mental Health Diagnostic Assessments -- The Depression and Anxiety Case
Abstract:
Large language models (LLMs) are increasingly attracting the attention of healthcare professionals for their potential to assist in diagnostic assessments, which could alleviate the strain on the healthcare system caused by a high patient load and a shortage of providers. For LLMs to be effective in supporting diagnostic assessments, it is essential that they closely replicate the standard diagnostic procedures used by clinicians. In this paper, we specifically examine the diagnostic assessment processes described in the Patient Health Questionnaire-9 (PHQ-9) for major depressive disorder (MDD) and the Generalized Anxiety Disorder-7 (GAD-7) questionnaire for generalized anxiety disorder (GAD). We investigate various prompting and fine-tuning techniques to guide both proprietary and open-source LLMs in adhering to these processes, and we evaluate the agreement between LLM-generated diagnostic outcomes and expert-validated ground truth. For fine-tuning, we utilize the Mentalllama and Llama models, while for prompting, we experiment with proprietary models like GPT-3.5 and GPT-4o, as well as open-source models such as llama-3.1-8b and mixtral-8x7b.
中文: 大型语言模型正被研究用于通过模拟临床程序来支持医疗诊断评估,本研究重点探讨其通过提示和微调技术与PHQ-9和GAD-7诊断标准的一致性。
English: Large language models are being explored to support diagnostic assessments in healthcare by replicating clinical procedures, with this study focusing on their alignment with PHQ-9 and GAD-7 protocols through prompting and fine-tuning techniques.

Authors:Mengjie Qin, Yuchao Feng, Zongliang Wu, Yulun Zhang, Xin Yuan
Title: Detail Matters: Mamba-Inspired Joint Unfolding Network for Snapshot Spectral Compressive Imaging
Abstract:
In the coded aperture snapshot spectral imaging system, Deep Unfolding Networks (DUNs) have made impressive progress in recovering 3D hyperspectral images (HSIs) from a single 2D measurement. However, the inherent nonlinear and ill-posed characteristics of HSI reconstruction still pose challenges to existing methods in terms of accuracy and stability. To address this issue, we propose a Mamba-inspired Joint Unfolding Network (MiJUN), which integrates physics-embedded DUNs with learning-based HSI imaging. Firstly, leveraging the concept of trapezoid discretization to expand the representation space of unfolding networks, we introduce an accelerated unfolding network scheme. This approach can be interpreted as a generalized accelerated half-quadratic splitting with a second-order differential equation, which reduces the reliance on initial optimization stages and addresses challenges related to long-range interactions. Crucially, within the Mamba framework, we restructure the Mamba-inspired global-to-local attention mechanism by incorporating a selective state space model and an attention mechanism. This effectively reinterprets Mamba as a variant of the Transformer} architecture, improving its adaptability and efficiency. Furthermore, we refine the scanning strategy with Mamba by integrating the tensor mode-$k$ unfolding into the Mamba network. This approach emphasizes the low-rank properties of tensors along various modes, while conveniently facilitating 12 scanning directions. Numerical and visual comparisons on both simulation and real datasets demonstrate the superiority of our proposed MiJUN, and achieving overwhelming detail representation.
中文: 提出的Mamba启发联合展开网络(MiJUN)通过将物理驱动的深度展开网络与基于学习的高光谱成像相结合,借助创新的扫描策略和注意力机制,实现了在精度和细节还原方面更优越的高光谱图像重建效果。
English: The proposed Mamba-inspired Joint Unfolding Network (MiJUN) enhances hyperspectral image reconstruction by integrating physics-based deep unfolding networks with learning-based imaging, achieving superior accuracy and detail representation through innovative scanning strategies and attention mechanisms.

Authors:Ali Behrouz, Peilin Zhong, Vahab Mirrokni
Title: Titans: Learning to Memorize at Test Time
Abstract:
Over more than a decade there has been an extensive research effort on how to effectively utilize recurrent models and attention. While recurrent models aim to compress the data into a fixed-size memory (called hidden state), attention allows attending to the entire context window, capturing the direct dependencies of all tokens. This more accurate modeling of dependencies, however, comes with a quadratic cost, limiting the model to a fixed-length context. We present a new neural long-term memory module that learns to memorize historical context and helps attention to attend to the current context while utilizing long past information. We show that this neural memory has the advantage of fast parallelizable training while maintaining a fast inference. From a memory perspective, we argue that attention due to its limited context but accurate dependency modeling performs as a short-term memory, while neural memory due to its ability to memorize the data, acts as a long-term, more persistent, memory. Based on these two modules, we introduce a new family of architectures, called Titans, and present three variants to address how one can effectively incorporate memory into this architecture. Our experimental results on language modeling, common-sense reasoning, genomics, and time series tasks show that Titans are more effective than Transformers and recent modern linear recurrent models. They further can effectively scale to larger than 2M context window size with higher accuracy in needle-in-haystack tasks compared to baselines.
中文: 研究人员提出了Titans新架构,结合注意力机制的短期记忆与神经长期记忆模块,能在多种任务中高效处理超长上下文,实现更高准确性和扩展性。
English: Researchers have developed Titans, a new neural architecture combining attention for short-term memory and a neural long-term memory module to efficiently handle extensive context with improved accuracy and scalability across various tasks.

Authors:Chengbo He, Bochao Zou, Xin Li, Jiansheng Chen, Junliang Xing, Huimin Ma
Title: Enhancing LLM Reasoning with Multi-Path Collaborative Reactive and Reflection agents
Abstract:
Agents have demonstrated their potential in scientific reasoning tasks through large language models. However, they often face challenges such as insufficient accuracy and degeneration of thought when handling complex reasoning tasks, which impede their performance. To overcome these issues, we propose the Reactive and Reflection agents with Multi-Path Reasoning (RR-MP) Framework, aimed at enhancing the reasoning capabilities of LLMs. Our approach improves scientific reasoning accuracy by employing a multi-path reasoning mechanism where each path consists of a reactive agent and a reflection agent that collaborate to prevent degeneration of thought inherent in single-agent reliance. Additionally, the RR-MP framework does not require additional training; it utilizes multiple dialogue instances for each reasoning path and a separate summarizer to consolidate insights from all paths. This design integrates diverse perspectives and strengthens reasoning across each path. We conducted zero-shot and few-shot evaluations on tasks involving moral scenarios, college-level physics, and mathematics. Experimental results demonstrate that our method outperforms baseline approaches, highlighting the effectiveness and advantages of the RR-MP framework in managing complex scientific reasoning tasks.
中文: RR-MP框架通过反应与反思代理的多路径协作机制,无需额外训练即可提升大语言模型的科学推理准确性,在复杂任务中表现优于基线方法。
English: The RR-MP framework enhances scientific reasoning in LLMs through multi-path collaboration between reactive and reflection agents, improving accuracy without additional training and outperforming baselines in complex tasks.

Authors:Pat Pataranutaporn, Nattavudh Powdthavee, Pattie Maes
Title: Algorithmic Inheritance: Surname Bias in AI Decisions Reinforces Intergenerational Inequality
Abstract:
Surnames often convey implicit markers of social status, wealth, and lineage, shaping perceptions in ways that can perpetuate systemic biases and intergenerational inequality. This study is the first of its kind to investigate whether and how surnames influence AI-driven decision-making, focusing on their effects across key areas such as hiring recommendations, leadership appointments, and loan approvals. Using 72,000 evaluations of 600 surnames from the United States and Thailand, two countries with distinct sociohistorical contexts and surname conventions, we classify names into four categories: Rich, Legacy, Normal, and phonetically similar Variant groups. Our findings show that elite surnames consistently increase AI-generated perceptions of power, intelligence, and wealth, which in turn influence AI-driven decisions in high-stakes contexts. Mediation analysis reveals perceived intelligence as a key mechanism through which surname biases influence AI decision-making process. While providing objective qualifications alongside surnames mitigates most of these biases, it does not eliminate them entirely, especially in contexts where candidate credentials are low. These findings highlight the need for fairness-aware algorithms and robust policy measures to prevent AI systems from reinforcing systemic inequalities tied to surnames, an often-overlooked bias compared to more salient characteristics such as race and gender. Our work calls for a critical reassessment of algorithmic accountability and its broader societal impact, particularly in systems designed to uphold meritocratic principles while counteracting the perpetuation of intergenerational privilege.
中文摘要:本研究发现精英姓氏通过提升对智能和权力的感知,系统性地影响人工智能在招聘和贷款等领域的决策,亟需公平算法来消除这种常被忽视的姓氏偏见。
English Summary: This study reveals that elite surnames systematically bias AI decision-making in areas like hiring and loans by enhancing perceptions of intelligence and power, necessitating fairness-aware algorithms to counteract this overlooked form of inequality.

Authors:Xutong Liu, Xiangxiang Dai, Jinhang Zuo, Siwei Wang, Carlee Joe-Wong, John C. S. Lui, Wei Chen
Title: Offline Learning for Combinatorial Multi-armed Bandits
Abstract:
The combinatorial multi-armed bandit (CMAB) is a fundamental sequential decision-making framework, extensively studied over the past decade. However, existing work primarily focuses on the online setting, overlooking the substantial costs of online interactions and the readily available offline datasets. To overcome these limitations, we introduce Off-CMAB, the first offline learning framework for CMAB. Central to our framework is the combinatorial lower confidence bound (CLCB) algorithm, which combines pessimistic reward estimations with combinatorial solvers. To characterize the quality of offline datasets, we propose two novel data coverage conditions and prove that, under these conditions, CLCB achieves a near-optimal suboptimality gap, matching the theoretical lower bound up to a logarithmic factor. We validate Off-CMAB through practical applications, including learning to rank, large language model (LLM) caching, and social influence maximization, showing its ability to handle nonlinear reward functions, general feedback models, and out-of-distribution action samples that excludes optimal or even feasible actions. Extensive experiments on synthetic and real-world datasets further highlight the superior performance of CLCB.
中文: 本研究提出了首个组合多臂老虎机离线学习框架Off-CMAB,通过悲观估计算法在新提出的数据覆盖条件下实现近乎最优的性能,并在排序和大语言模型缓存等应用中验证了其有效性。
English: The study introduces Off-CMAB, the first offline learning framework for combinatorial multi-armed bandits, which uses a pessimistic algorithm to achieve near-optimal performance under novel data coverage conditions and demonstrates effectiveness in applications like ranking and LLM caching.

Authors:Stephen Meisenbacher, Maulik Chevli, Florian Matthes
Title: On the Impact of Noise in Differentially Private Text Rewriting
Abstract:
The field of text privatization often leverages the notion of $\textit{Differential Privacy}$ (DP) to provide formal guarantees in the rewriting or obfuscation of sensitive textual data. A common and nearly ubiquitous form of DP application necessitates the addition of calibrated noise to vector representations of text, either at the data- or model-level, which is governed by the privacy parameter $\varepsilon$. However, noise addition almost undoubtedly leads to considerable utility loss, thereby highlighting one major drawback of DP in NLP. In this work, we introduce a new sentence infilling privatization technique, and we use this method to explore the effect of noise in DP text rewriting. We empirically demonstrate that non-DP privatization techniques excel in utility preservation and can find an acceptable empirical privacy-utility trade-off, yet cannot outperform DP methods in empirical privacy protections. Our results highlight the significant impact of noise in current DP rewriting mechanisms, leading to a discussion of the merits and challenges of DP in NLP, as well as the opportunities that non-DP methods present.
中文摘要:本研究提出了一种句子填充技术来探讨噪声在文本重写中对差分隐私的影响,发现非差分隐私方法虽能更好地保持实用性,但在隐私保护上不及差分隐私方法,突显了自然语言处理中隐私与效用的权衡挑战。
English Summary: This study introduces a sentence infilling technique to examine how noise affects differential privacy in text rewriting, revealing that while non-DP methods preserve utility better, they fall short of DP's privacy protections, underscoring the trade-offs in NLP applications.

Authors:Khai Nguyen, Hai Nguyen, Tuan Pham, Nhat Ho
Title: Lightspeed Geometric Dataset Distance via Sliced Optimal Transport
Abstract:
We introduce sliced optimal transport dataset distance (s-OTDD), a model-agnostic, embedding-agnostic approach for dataset comparison that requires no training, is robust to variations in the number of classes, and can handle disjoint label sets. The core innovation is Moment Transform Projection (MTP), which maps a label, represented as a distribution over features, to a real number. Using MTP, we derive a data point projection that transforms datasets into one-dimensional distributions. The s-OTDD is defined as the expected Wasserstein distance between the projected distributions, with respect to random projection parameters. Leveraging the closed form solution of one-dimensional optimal transport, s-OTDD achieves (near-)linear computational complexity in the number of data points and feature dimensions and is independent of the number of classes. With its geometrically meaningful projection, s-OTDD strongly correlates with the optimal transport dataset distance while being more efficient than existing dataset discrepancy measures. Moreover, it correlates well with the performance gap in transfer learning and classification accuracy in data augmentation.
中文:s-OTDD是一种无需训练、高效的数据集比较方法,通过一维投影和Wasserstein距离实现,与迁移学习性能和分类准确率密切相关。
English: The s-OTDD is a training-free, efficient method for comparing datasets using one-dimensional projections and Wasserstein distance, which correlates well with transfer learning performance and classification accuracy.

Authors:Zag ElSayed, Ahmed Abdelgawad, Nelly Elsayed
Title: CryptoDNA: A Machine Learning Paradigm for DDoS Detection in Healthcare IoT, Inspired by crypto jacking prevention Models
Abstract:
The rapid integration of the Internet of Things (IoT) and Internet of Medical (IoM) devices in the healthcare industry has markedly improved patient care and hospital operations but has concurrently brought substantial risks. Distributed Denial-of-Service (DDoS) attacks present significant dangers, jeopardizing operational stability and patient safety. This study introduces CryptoDNA, an innovative machine learning detection framework influenced by cryptojacking detection methods, designed to identify and alleviate DDoS attacks in healthcare IoT settings. The proposed approach relies on behavioral analytics, including atypical resource usage and network activity patterns. Key features derived from cryptojacking-inspired methodologies include entropy-based analysis of traffic, time-series monitoring of device performance, and dynamic anomaly detection. A lightweight architecture ensures inter-compatibility with resource-constrained IoT devices while maintaining high detection accuracy. The proposed architecture and model were tested in real-world and synthetic datasets to demonstrate the model's superior performance, achieving over 96% accuracy with minimal computational overhead. Comparative analysis reveals its resilience against emerging attack vectors and scalability across diverse device ecosystems. By bridging principles from cryptojacking and DDoS detection, CryptoDNA offers a robust, innovative solution to fortify the healthcare IoT landscape against evolving cyber threats and highlights the potential of interdisciplinary approaches in adaptive cybersecurity defense mechanisms for critical healthcare infrastructures.
中文: 本研究提出CryptoDNA框架,借鉴加密货币劫持检测方法,通过行为分析和动态异常检测,在医疗物联网环境中以超过96%的准确率高效识别DDoS攻击,且计算开销极小。
English: This study introduces CryptoDNA, a machine learning framework inspired by cryptojacking detection methods, which effectively identifies DDoS attacks in healthcare IoT systems with over 96% accuracy and minimal resource usage.

Authors:Jie Peng, Shuang Zhou, Longwei Yang, Yiran Song, Mohan Zhang, Kaixiong Zhou, Feng Xie, Mingquan Lin, Rui Zhang, Tianlong Chen
Title: Continually Evolved Multimodal Foundation Models for Cancer Prognosis
Abstract:
Cancer prognosis is a critical task that involves predicting patient outcomes and survival rates. To enhance prediction accuracy, previous studies have integrated diverse data modalities, such as clinical notes, medical images, and genomic data, leveraging their complementary information. However, existing approaches face two major limitations. First, they struggle to incorporate newly arrived data with varying distributions into training, such as patient records from different hospitals, thus rendering sub-optimal generalizability and limited utility in real-world applications. Second, most multimodal integration methods rely on simplistic concatenation or task-specific pipelines, which fail to capture the complex interdependencies across modalities. To address these, we propose a continually evolving multi-modal foundation model. Extensive experiments on the TCGA dataset demonstrate the effectiveness of our approach, highlighting its potential to advance cancer prognosis by enabling robust and adaptive multimodal integration.
中文: 该摘要提出了一种持续演化的多模态基础模型,旨在解决癌症预后中数据分布差异和模态间复杂依赖关系捕获的局限性,并通过TCGA数据集的实验验证了其有效性。
English: This abstract proposes a continually evolving multi-modal foundation model to overcome limitations in cancer prognosis, such as handling diverse data distributions and capturing complex interdependencies across modalities, demonstrating its effectiveness through TCGA dataset experiments.

Authors:Xinyu Wang, Lei Liu, Kang Chen, Tao Han, Bin Li, Lei Bai
Title: VQLTI: Long-Term Tropical Cyclone Intensity Forecasting with Physical Constraints
Abstract:
Tropical cyclone (TC) intensity forecasting is crucial for early disaster warning and emergency decision-making. Numerous researchers have explored deep-learning methods to address computational and post-processing issues in operational forecasting. Regrettably, they exhibit subpar long-term forecasting capabilities. We use two strategies to enhance long-term forecasting. (1) By enhancing the matching between TC intensity and spatial information, we can improve long-term forecasting performance. (2) Incorporating physical knowledge and physical constraints can help mitigate the accumulation of forecasting errors. To achieve the above strategies, we propose the VQLTI framework. VQLTI transfers the TC intensity information to a discrete latent space while retaining the spatial information differences, using large-scale spatial meteorological data as conditions. Furthermore, we leverage the forecast from the weather prediction model FengWu to provide additional physical knowledge for VQLTI. Additionally, we calculate the potential intensity (PI) to impose physical constraints on the latent variables. In the global long-term TC intensity forecasting, VQLTI achieves state-of-the-art results for the 24h to 120h, with the MSW (Maximum Sustained Wind) forecast error reduced by 35.65%-42.51% compared to ECMWF-IFS.
中文摘要:VQLTI框架通过融合空间信息与物理知识,将热带气旋强度预报误差降低35.65%-42.51%,实现了全球长期强度预报的技术突破。
English Summary: The VQLTI framework enhances long-term tropical cyclone intensity forecasting by integrating spatial information with physical knowledge, reducing maximum sustained wind forecast errors by 35.65%-42.51% compared to ECMWF-IFS.

Authors:Hao Mo, Yaping Sun, Shumin Yao, Hao Chen, Zhiyong Chen, Xiaodong Xu, Nan Ma, Meixia Tao, Shuguang Cui
Title: SCDM: Score-Based Channel Denoising Model for Digital Semantic Communications
Abstract:
Score-based diffusion models represent a significant variant within the diffusion model family and have seen extensive application in the increasingly popular domain of generative tasks. Recent investigations have explored the denoising potential of diffusion models in semantic communications. However, in previous paradigms, noise distortion in the diffusion process does not match precisely with digital channel noise characteristics. In this work, we introduce the Score-Based Channel Denoising Model (SCDM) for Digital Semantic Communications (DSC). SCDM views the distortion of constellation symbol sequences in digital transmission as a score-based forward diffusion process. We design a tailored forward noise corruption to align digital channel noise properties in the training phase. During the inference stage, the well-trained SCDM can effectively denoise received semantic symbols under various SNR conditions, reducing the difficulty for the semantic decoder in extracting semantic information from the received noisy symbols and thereby enhancing the robustness of the reconstructed semantic information. Experimental results show that SCDM outperforms the baseline model in PSNR, SSIM, and MSE metrics, particularly at low SNR levels. Moreover, SCDM reduces storage requirements by a factor of 7.8. This efficiency in storage, combined with its robust denoising capability, makes SCDM a practical solution for DSC across diverse channel conditions.
Chinese: 本文针对数字语义通信提出了基于分数的信道去噪模型(SCDM),通过将扩散过程与数字信道噪声特性对齐,有效去除语义符号噪声,提升重建鲁棒性,同时将存储需求降低7.8倍。
English: This paper introduces the Score-Based Channel Denoising Model (SCDM) for digital semantic communications, which aligns the diffusion process with digital channel noise to effectively denoise semantic symbols and enhance reconstruction robustness while reducing storage requirements by 7.8 times.

Authors:Zijie Liu, Xinyu Zhao, Jie Peng, Zhuangdi Zhu, Qingyu Chen, Xia Hu, Tianlong Chen
Title: Dialogue is Better Than Monologue: Instructing Medical LLMs via Strategical Conversations
Abstract:
Current medical AI systems often fail to replicate real-world clinical reasoning, as they are predominantly trained and evaluated on static text and question-answer tasks. These tuning methods and benchmarks overlook critical aspects like evidence-based reasoning and handling distracting information. To bridge this gap, we introduce a novel benchmark that simulates real-world diagnostic scenarios, integrating noise and difficulty levels aligned with USMLE standards. Moreover, we explore dialogue-based fine-tuning, which transforms static datasets into conversational formats to better capture iterative reasoning processes. Experiments show that dialogue-tuned models outperform traditional methods, with improvements of $9.64\%$ in multi-round reasoning scenarios and $6.18\%$ in accuracy in a noisy environment. Our findings highlight dialogue tuning as a promising approach for advancing clinically aligned and robust medical AI systems.
Chinese: 当前医学AI系统因静态训练方法在真实临床推理中表现不足,而新型对话式微调方法显著提升了性能,使多轮推理能力提高9.64%,嘈杂环境下的准确率提升6.18%。
English: Current medical AI systems often fall short in real-world clinical reasoning due to static training methods, but a new dialogue-based fine-tuning approach significantly enhances performance, improving multi-round reasoning by 9.64% and accuracy in noisy environments by 6.18%.

Authors:Yuxuan Chen, Jiawen Li, Huijuan Shi, Yang Xu, Tian Guan, Lianghui Zhu, Yonghong He, Anjia Han
Title: Dynamic Hypergraph Representation for Bone Metastasis Cancer Analysis
Abstract:
Bone metastasis analysis is a significant challenge in pathology and plays a critical role in determining patient quality of life and treatment strategies. The microenvironment and specific tissue structures are essential for pathologists to predict the primary bone cancer origins and primary bone cancer subtyping. By digitizing bone tissue sections into whole slide images (WSIs) and leveraging deep learning to model slide embeddings, this analysis can be enhanced. However, tumor metastasis involves complex multivariate interactions with diverse bone tissue structures, which traditional WSI analysis methods such as multiple instance learning (MIL) fail to capture. Moreover, graph neural networks (GNNs), limited to modeling pairwise relationships, are hard to represent high-order biological associations. To address these challenges, we propose a dynamic hypergraph neural network (DyHG) that overcomes the edge construction limitations of traditional graph representations by connecting multiple nodes via hyperedges. A low-rank strategy is used to reduce the complexity of parameters in learning hypergraph structures, while a Gumbel-Softmax-based sampling strategy optimizes the patch distribution across hyperedges. An MIL aggregator is then used to derive a graph-level embedding for comprehensive WSI analysis. To evaluate the effectiveness of DyHG, we construct two large-scale datasets for primary bone cancer origins and subtyping classification based on real-world bone metastasis scenarios. Extensive experiments demonstrate that DyHG significantly outperforms state-of-the-art (SOTA) baselines, showcasing its ability to model complex biological interactions and improve the accuracy of bone metastasis analysis.
中文: 本研究提出动态超图神经网络(DyHG),通过建模全切片图像中的复杂组织相互作用来增强骨转移分析,在癌症起源和亚型分类任务上显著优于现有方法。
English: This study introduces a dynamic hypergraph neural network (DyHG) to enhance bone metastasis analysis by modeling complex tissue interactions in whole slide images, significantly outperforming existing methods in classifying cancer origins and subtypes.

Authors:Wataru Masaka, Mitsuki Sakamoto, Kenshi Abe, Kaito Ariu, Tuomas Sandholm, Atsushi Iwasaki
Title: On the Power of Perturbation under Sampling in Solving Extensive-Form Games
Abstract:
We investigate how perturbation does and does not improve the Follow-the-Regularized-Leader (FTRL) algorithm in solving imperfect-information extensive-form games under sampling, where payoffs are estimated from sampled trajectories. While optimistic algorithms are effective under full feedback, they often become unstable in the presence of sampling noise. Payoff perturbation offers a promising alternative for stabilizing learning and achieving \textit{last-iterate convergence}. We present a unified framework for \textit{Perturbed FTRL} algorithms and study two variants: PFTRL-KL (standard KL divergence) and PFTRL-RKL (Reverse KL divergence), the latter featuring an estimator with both unbiasedness and conditional zero variance. While PFTRL-KL generally achieves equivalent or better performance across benchmark games, PFTRL-RKL consistently outperforms it in Leduc poker, whose structure is more asymmetric than the other games in a sense. Given the modest advantage of PFTRL-RKL, we design the second experiment to isolate the effect of conditional zero variance, showing that the variance-reduction property of RKL improve last-iterate performance.
中文: 本研究探讨了在采样支付的不完美信息扩展式博弈中,扰动如何稳定跟随正则化领导者算法,表明虽然PFTRL-KL通常表现良好,但PFTRL-RKL凭借其方差缩减特性在非对称博弈(如Leduc扑克)中表现更优。
English: This study explores how perturbation stabilizes Follow-the-Regularized-Leader algorithms in imperfect-information games with sampled payoffs, showing that while PFTRL-KL generally performs well, PFTRL-RKL excels in asymmetric games like Leduc poker due to its variance-reduction properties.

Authors:Kirill Paramonov, Mete Ozay, Eunju Yang, Jijoong Moon, Umberto Michieli
Title: Controllable Forgetting Mechanism for Few-Shot Class-Incremental Learning
Abstract:
Class-incremental learning in the context of limited personal labeled samples (few-shot) is critical for numerous real-world applications, such as smart home devices. A key challenge in these scenarios is balancing the trade-off between adapting to new, personalized classes and maintaining the performance of the model on the original, base classes. Fine-tuning the model on novel classes often leads to the phenomenon of catastrophic forgetting, where the accuracy of base classes declines unpredictably and significantly. In this paper, we propose a simple yet effective mechanism to address this challenge by controlling the trade-off between novel and base class accuracy. We specifically target the ultra-low-shot scenario, where only a single example is available per novel class. Our approach introduces a Novel Class Detection (NCD) rule, which adjusts the degree of forgetting a priori while simultaneously enhancing performance on novel classes. We demonstrate the versatility of our solution by applying it to state-of-the-art Few-Shot Class-Incremental Learning (FSCIL) methods, showing consistent improvements across different settings. To better quantify the trade-off between novel and base class performance, we introduce new metrics: NCR@2FOR and NCR@5FOR. Our approach achieves up to a 30% improvement in novel class accuracy on the CIFAR100 dataset (1-shot, 1 novel class) while maintaining a controlled base class forgetting rate of 2%.
中文: 本文提出了一种简单机制,通过控制新旧类别准确率的权衡来解决超低样本类别增量学习中的灾难性遗忘问题,在仅2%基础类别遗忘率下实现了新类别准确率高达30%的提升。
English: This paper introduces a simple mechanism to address catastrophic forgetting in ultra-low-shot class-incremental learning by controlling the trade-off between novel and base class accuracy, achieving up to 30% improvement in novel class performance with minimal base class degradation.

Authors:Zhiling Chen, Hanning Chen, Mohsen Imani, Farhad Imani
Title: Can Multimodal Large Language Models be Guided to Improve Industrial Anomaly Detection?
Abstract:
In industrial settings, the accurate detection of anomalies is essential for maintaining product quality and ensuring operational safety. Traditional industrial anomaly detection (IAD) models often struggle with flexibility and adaptability, especially in dynamic production environments where new defect types and operational changes frequently arise. Recent advancements in Multimodal Large Language Models (MLLMs) hold promise for overcoming these limitations by combining visual and textual information processing capabilities. MLLMs excel in general visual understanding due to their training on large, diverse datasets, but they lack domain-specific knowledge, such as industry-specific defect tolerance levels, which limits their effectiveness in IAD tasks. To address these challenges, we propose Echo, a novel multi-expert framework designed to enhance MLLM performance for IAD. Echo integrates four expert modules: Reference Extractor which provides a contextual baseline by retrieving similar normal images, Knowledge Guide which supplies domain-specific insights, Reasoning Expert which enables structured, stepwise reasoning for complex queries, and Decision Maker which synthesizes information from all modules to deliver precise, context-aware responses. Evaluated on the MMAD benchmark, Echo demonstrates significant improvements in adaptability, precision, and robustness, moving closer to meeting the demands of real-world industrial anomaly detection.
中文摘要:提出的Echo框架通过将多模态大语言模型与专业专家模块相结合,显著提升了动态环境中工业异常检测的适应性和精确度。
English Summary: The proposed Echo framework enhances industrial anomaly detection by integrating multimodal large language models with specialized expert modules, significantly improving adaptability and precision in dynamic environments.

Authors:Jiahang Tu, Qian Feng, Chufan Chen, Jiahua Dong, Hanbin Zhao, Chao Zhang, Hui Qian
Title: CE-SDWV: Effective and Efficient Concept Erasure for Text-to-Image Diffusion Models via a Semantic-Driven Word Vocabulary
Abstract:
Large-scale text-to-image (T2I) diffusion models have achieved remarkable generative performance about various concepts. With the limitation of privacy and safety in practice, the generative capability concerning NSFW (Not Safe For Work) concepts is undesirable, e.g., producing sexually explicit photos, and licensed images. The concept erasure task for T2I diffusion models has attracted considerable attention and requires an effective and efficient method. To achieve this goal, we propose a CE-SDWV framework, which removes the target concepts (e.g., NSFW concepts) of T2I diffusion models in the text semantic space by only adjusting the text condition tokens and does not need to re-train the original T2I diffusion model's weights. Specifically, our framework first builds a target concept-related word vocabulary to enhance the representation of the target concepts within the text semantic space, and then utilizes an adaptive semantic component suppression strategy to ablate the target concept-related semantic information in the text condition tokens. To further adapt the above text condition tokens to the original image semantic space, we propose an end-to-end gradient-orthogonal token optimization strategy. Extensive experiments on I2P and UnlearnCanvas benchmarks demonstrate the effectiveness and efficiency of our method.
中文:CE-SDWV框架通过调整文本条件令牌在语义空间中消除文本到图像扩散模型中的不良概念(如NSFW内容),无需重新训练模型,实验证明其高效有效。
English: The CE-SDWV framework effectively removes undesirable concepts like NSFW content from text-to-image diffusion models by modifying text tokens in the semantic space without retraining the model, as validated through extensive experiments.

Authors:Yanbiao Ji, Dan Luo, Chang Liu, Shaokai Wu, Jing Tong, Qicheng He, Deyi Ji, Hongtao Lu, Yue Ding
Title: Generating Negative Samples for Multi-Modal Recommendation
Abstract:
Multi-modal recommender systems (MMRS) have gained significant attention due to their ability to leverage information from various modalities to enhance recommendation quality. However, existing negative sampling techniques often struggle to effectively utilize the multi-modal data, leading to suboptimal performance. In this paper, we identify two key challenges in negative sampling for MMRS: (1) producing cohesive negative samples contrasting with positive samples and (2) maintaining a balanced influence across different modalities. To address these challenges, we propose NegGen, a novel framework that utilizes multi-modal large language models (MLLMs) to generate balanced and contrastive negative samples. We design three different prompt templates to enable NegGen to analyze and manipulate item attributes across multiple modalities, and then generate negative samples that introduce better supervision signals and ensure modality balance. Furthermore, NegGen employs a causal learning module to disentangle the effect of intervened key features and irrelevant item attributes, enabling fine-grained learning of user preferences. Extensive experiments on real-world datasets demonstrate the superior performance of NegGen compared to state-of-the-art methods in both negative sampling and multi-modal recommendation.
中文: 本文提出NegGen框架,利用多模态大语言模型生成平衡且对比性的负样本,通过因果学习模块解决多模态推荐系统中负采样面临的样本对比性不足和模态影响不均衡问题,显著提升了推荐性能。
English: This paper introduces NegGen, a novel framework that leverages multi-modal large language models to generate balanced and contrastive negative samples, addressing key challenges in multi-modal recommender systems by improving supervision signals and ensuring modality balance through causal learning.

Authors:Alexander Bakhtin, Matteo Esposito, Valentina Lenarduzzi, Davide Taibi
Title: Network Centrality as a New Perspective on Microservice Architecture
Abstract:
Context: Over the past decade, the adoption of Microservice Architecture (MSA) has led to the identification of various patterns and anti-patterns, such as Nano/Mega/Hub services. Detecting these anti-patterns often involves modeling the system as a Service Dependency Graph (SDG) and applying graph-theoretic approaches. Aim: While previous research has explored software metrics (SMs) such as size, complexity, and quality for assessing MSAs, the potential of graph-specific metrics like network centrality remains largely unexplored. This study investigates whether centrality metrics (CMs) can provide new insights into MSA quality and facilitate the detection of architectural anti-patterns, complementing or extending traditional SMs. Method: We analyzed 24 open-source MSA projects, reconstructing their architectures to study 53 microservices. We measured SMs and CMs for each microservice and tested their correlation to determine the relationship between these metric types. Results and Conclusion: Among 902 computed metric correlations, we found weak to moderate correlation in 282 cases. These findings suggest that centrality metrics offer a novel perspective for understanding MSA properties. Specifically, ratio-based centrality metrics show promise for detecting specific anti-patterns, while subgraph centrality needs further investigation for its applicability in architectural assessments.
中文摘要:本研究探索将图中心性指标与传统软件指标结合评估微服务架构质量,发现弱至中等相关性表明中心性指标为检测架构反模式提供了新视角。
English Summary: This study explores the use of graph centrality metrics alongside traditional software metrics to assess microservice architecture quality, finding weak to moderate correlations that suggest centrality metrics provide novel insights for detecting architectural anti-patterns.

Authors:Ismail Cosandal, Sennur Ulukus, Nail Akar
Title: Which Sensor to Observe? Timely Tracking of a Joint Markov Source with Model Predictive Control
Abstract:
In this paper, we investigate the problem of remote estimation of a discrete-time joint Markov process using multiple sensors. Each sensor observes a different component of the joint Markov process, and in each time slot, the monitor obtains a partial state value by sending a pull request to one of the sensors. The monitor chooses the sequence of sensors to observe with the goal of minimizing the mean of age of incorrect information (MAoII) by using the partial state observations obtained, which have different freshness levels. For instance, a monitor may be interested in tracking the location of an object by obtaining observations from two sensors, which observe the $x$ and $y$ coordinates of the object separately, in different time slots. The monitor, then, needs to decide which coordinate to observe in the next time slot given the history. In addition to this partial observability of the state of Markov process, there is an erasure channel with a fixed one-slot delay between each sensor and the monitor. First, we obtain a sufficient statistic, namely the \emph{belief}, representing the joint distribution of the age of incorrect information (AoII) and the current state of the observed process by using the history of all pull requests and observations. Then, we formulate the problem with a continuous state-space Markov decision problem (MDP), namely belief MDP. To solve the problem, we propose two model predictive control (MPC) methods, namely MPC without terminal costs (MPC-WTC) and reinforcement learning MPC (RL-MPC), that have different advantages in implementation.
本研究通过多传感器远程估计联合马尔可夫过程,在部分可观测和通信延迟条件下,采用模型预测控制方法优化传感器选择策略以最小化信息不准确度。
This study explores remote estimation of a joint Markov process using multiple sensors, developing model predictive control methods to minimize information inaccuracy by strategically selecting sensors under partial observability and communication delays.

Authors:Xiancai Chen, Zhengwei Tao, Kechi Zhang, Changzhi Zhou, Wanli Gu, Yuanpeng He, Mengdi Zhang, Xunliang Cai, Haiyan Zhao, Zhi Jin
Title: Revisit Self-Debugging with Self-Generated Tests for Code Generation
Abstract:
Large language models (LLMs) have shown significant advancements in code generation, but still face challenges on tasks beyond their basic capabilities. Recently, the notion of self-debugging has been proposed to boost the performance of code generation by leveraging execution feedback from tests. Despite its promise, the availability of high-quality tests in real-world scenarios is limited. In this context, self-debugging with self-generated tests is a promising solution but lacks a full exploration of its limitations and practical potential. Therefore, we investigate its efficacy on diverse programming problems. To deepen our understanding, we propose two distinct paradigms for the process: post-execution and in-execution self-debugging. Within the scope of self-contained Python programming tasks, we find that post-execution self-debugging struggles on basic problems but shows potential for improvement on competitive ones, due to the bias introduced by self-generated tests. On the other hand, in-execution self-debugging enables LLMs to mitigate the bias by solely leveraging intermediate states during execution, thereby enhancing code generation.
中文: 利用自生成测试的自调试在基础编程任务上效果有限,但在竞争性问题上潜力显著,其中执行中调试通过利用中间状态减少偏差,表现优于执行后调试。
English: Self-debugging using self-generated tests shows limited effectiveness on basic programming tasks but holds promise for competitive problems, with in-execution debugging outperforming post-execution by leveraging intermediate states to reduce bias.

Authors:Chaochen Gao, Xing Wu, Zijia Lin, Debing Zhang, Songlin Hu
Title: NExtLong: Toward Effective Long-Context Training without Long Documents
Abstract:
Large language models (LLMs) with extended context windows have made significant strides yet remain a challenge due to the scarcity of long documents. Existing methods tend to synthesize long-context data but lack a clear mechanism to reinforce the long-range dependency modeling. To address this limitation, we propose NExtLong, a novel framework for synthesizing long-context data through Negative document Extension. NExtLong decomposes a document into multiple meta-chunks and extends the context by interleaving hard negative distractors retrieved from pretraining corpora. This approach compels the model to discriminate long-range dependent context from distracting content, enhancing its ability to model long-range dependencies. Extensive experiments demonstrate that NExtLong achieves significant performance improvements on the HELMET and RULER benchmarks compared to existing long-context synthesis approaches and leading models, which are trained on non-synthetic long documents. These findings highlight NExtLong's ability to reduce reliance on non-synthetic long documents, making it an effective framework for developing advanced long-context LLMs.
中文摘要:NExtLong是一种通过分解文档为元块并插入困难负样本干扰项来合成长文本数据的新框架,迫使模型增强长距离依赖建模能力,并在基准测试中取得显著性能提升。
English Summary: NExtLong is a novel framework that synthesizes long-context data by decomposing documents into meta-chunks and interleaving hard negative distractors, forcing models to enhance long-range dependency modeling and achieving superior performance on benchmarks.

Authors:Matteo Esposito, Mikel Robredo, Murali Sridharan, Guilherme Horta Travassos, Rafael Peñaloza, Valentina Lenarduzzi
Title: A Call for Critically Rethinking and Reforming Data Analysis in Empirical Software Engineering
Abstract:
Context: Empirical Software Engineering (ESE) drives innovation in SE through qualitative and quantitative studies. However, concerns about the correct application of empirical methodologies have existed since the 2006 Dagstuhl seminar on SE. Objective: To analyze three decades of SE research, identify mistakes in statistical methods, and evaluate experts' ability to detect and address these issues. Methods: We conducted a literature survey of ~27,000 empirical studies, using LLMs to classify statistical methodologies as adequate or inadequate. Additionally, we selected 30 primary studies and held a workshop with 33 ESE experts to assess their ability to identify and resolve statistical issues. Results: Significant statistical issues were found in the primary studies, and experts showed limited ability to detect and correct these methodological problems, raising concerns about the broader ESE community's proficiency in this area. Conclusions. Despite our study's eventual limitations, its results shed light on recurring issues from promoting information copy-and-paste from past authors' works and the continuous publication of inadequate approaches that promote dubious results and jeopardize the spread of the correct statistical strategies among researchers. Besides, it justifies further investigation into empirical rigor in software engineering to expose these recurring issues and establish a framework for reassessing our field's foundation of statistical methodology application. Therefore, this work calls for critically rethinking and reforming data analysis in empirical software engineering, paving the way for our work soon.
中文摘要:本研究分析三十年软件工程研究,发现实证研究中存在显著统计错误,且专家识别纠正能力有限,亟需推动实证软件工程领域方法论改革。
English Summary: This study analyzed three decades of software engineering research and found significant statistical errors in empirical studies, with experts demonstrating limited ability to detect or correct these issues, highlighting the need for methodological reform in the field.

Authors:Yafu Li, Zhilin Wang, Tingchen Fu, Ganqu Cui, Sen Yang, Yu Cheng
Title: From Drafts to Answers: Unlocking LLM Potential via Aggregation Fine-Tuning
Abstract:
Scaling data and model size has been proven effective for boosting the performance of large language models. In addition to training-time scaling, recent studies have revealed that increasing test-time computational resources can further improve performance. In this work, we introduce Aggregation Fine-Tuning (AFT), a supervised finetuning paradigm where the model learns to synthesize multiple draft responses, referred to as proposals, into a single, refined answer, termed aggregation. At inference time, a propose-and-aggregate strategy further boosts performance by iteratively generating proposals and aggregating them. Empirical evaluations on benchmark datasets show that AFT-trained models substantially outperform standard SFT. Notably, an AFT model, fine-tuned from Llama3.1-8B-Base with only 64k data, achieves a 41.3% LC win rate on AlpacaEval 2, surpassing significantly larger LLMs such as Llama3.1-405B-Instruct and GPT4. By combining sequential refinement and parallel sampling, the propose-and-aggregate framework scales inference-time computation in a flexible manner. Overall, These findings position AFT as a promising approach to unlocking additional capabilities of LLMs without resorting to increasing data volume or model size.
Chinese: 聚合微调(AFT)是一种新颖的监督微调范式,通过训练模型将多个草稿回答合成为精炼答案,显著提升了性能,即使使用较小模型和少量数据也能超越标准方法。
English: Aggregation Fine-Tuning (AFT) is a novel supervised fine-tuning method that enhances model performance by teaching it to synthesize multiple draft responses into a refined answer, significantly outperforming standard approaches even with smaller models and less data.

Authors:Xiaodong Li, Hengzhu Tang, Jiawei Sheng, Xinghua Zhang, Li Gao, Suqi Cheng, Dawei Yin, Tingwen Liu
Title: Exploring Preference-Guided Diffusion Model for Cross-Domain Recommendation
Abstract:
Cross-domain recommendation (CDR) has been proven as a promising way to alleviate the cold-start issue, in which the most critical problem is how to draw an informative user representation in the target domain via the transfer of user preference existing in the source domain. Prior efforts mostly follow the embedding-and-mapping paradigm, which first integrate the preference into user representation in the source domain, and then perform a mapping function on this representation to the target domain. However, they focus on mapping features across domains, neglecting to explicitly model the preference integration process, which may lead to learning coarse user representation. Diffusion models (DMs), which contribute to more accurate user/item representations due to their explicit information injection capability, have achieved promising performance in recommendation systems. Nevertheless, these DMs-based methods cannot directly account for valuable user preference in other domains, leading to challenges in adapting to the transfer of preference for cold-start users. Consequently, the feasibility of DMs for CDR remains underexplored. To this end, we explore to utilize the explicit information injection capability of DMs for user preference integration and propose a Preference-Guided Diffusion Model for CDR to cold-start users, termed as DMCDR. Specifically, we leverage a preference encoder to establish the preference guidance signal with the user's interaction history in the source domain. Then, we explicitly inject the preference guidance signal into the user representation step by step to guide the reverse process, and ultimately generate the personalized user representation in the target domain, thus achieving the transfer of user preference across domains. Furthermore, we comprehensively explore the impact of six DMs-based variants on CDR.
中文: 跨领域推荐通过扩散模型整合源领域的用户偏好,逐步注入偏好信号来生成目标领域的个性化用户表示,从而有效解决冷启动问题。
English: Cross-domain recommendation can address the cold-start problem by using diffusion models to integrate user preferences from a source domain, guiding the generation of personalized representations in the target domain through explicit information injection.

Authors:Chengze Ye, Linda-Sophie Schneider, Yipeng Sun, Mareike Thies, Andreas Maier
Title: Compressibility Analysis for the differentiable shift-variant Filtered Backprojection Model
Abstract:
The differentiable shift-variant filtered backprojection (FBP) model enables the reconstruction of cone-beam computed tomography (CBCT) data for any non-circular trajectories. This method employs deep learning technique to estimate the redundancy weights required for reconstruction, given knowledge of the specific trajectory at optimization time. However, computing the redundancy weight for each projection remains computationally intensive. This paper presents a novel approach to compress and optimize the differentiable shift-variant FBP model based on Principal Component Analysis (PCA). We apply PCA to the redundancy weights learned from sinusoidal trajectory projection data, revealing significant parameter redundancy in the original model. By integrating PCA directly into the differentiable shift-variant FBP reconstruction pipeline, we develop a method that decomposes the redundancy weight layer parameters into a trainable eigenvector matrix, compressed weights, and a mean vector. This innovative technique achieves a remarkable 97.25% reduction in trainable parameters without compromising reconstruction accuracy. As a result, our algorithm significantly decreases the complexity of the differentiable shift-variant FBP model and greatly improves training speed. These improvements make the model substantially more practical for real-world applications.
中文: 本文提出了一种基于主成分分析的可微分移变滤波反投影模型压缩方法,在保持重建精度的同时将可训练参数减少97.25%,显著提升了训练效率与实际应用价值。
English: This paper introduces a PCA-based compression method for the differentiable shift-variant FBP model that reduces trainable parameters by 97.25% while maintaining reconstruction accuracy, significantly enhancing training efficiency and practical applicability.

Authors:Zag ElSayed, Ahmed Abdelgawad, Nelly Elsayed
Title: Cybersecurity and Frequent Cyber Attacks on IoT Devices in Healthcare: Issues and Solutions
Abstract:
Integrating Internet of Things (IoT) devices in healthcare has revolutionized patient care, offering improved monitoring, diagnostics, and treatment. However, the proliferation of these devices has also introduced significant cybersecurity challenges. This paper reviews the current landscape of cybersecurity threats targeting IoT devices in healthcare, discusses the underlying issues contributing to these vulnerabilities, and explores potential solutions. Additionally, this study offers solutions and suggestions for researchers, agencies, and security specialists to overcome these IoT in healthcare cybersecurity vulnerabilities. A comprehensive literature survey highlights the nature and frequency of cyber attacks, their impact on healthcare systems, and emerging strategies to mitigate these risks.
中文: 医疗物联网设备在提升患者护理的同时带来了网络安全风险,本文通过综述威胁、漏洞及解决方案,为相关方提供应对这些挑战的策略建议。
English: IoT devices in healthcare enhance patient care but pose cybersecurity risks, which this paper examines by reviewing threats, vulnerabilities, and solutions while suggesting strategies for stakeholders to address these challenges.

Authors:Jingran Xie, Shun Lei, Yue Yu, Yang Xiang, Hui Wang, Xixin Wu, Zhiyong Wu
Title: Leveraging Chain of Thought towards Empathetic Spoken Dialogue without Corresponding Question-Answering Data
Abstract:
Empathetic dialogue is crucial for natural human-computer interaction, allowing the dialogue system to respond in a more personalized and emotionally aware manner, improving user satisfaction and engagement. The emergence of large language models (LLMs) has revolutionized dialogue generation by harnessing their powerful capabilities and shown its potential in multimodal domains. Many studies have integrated speech with text-based LLMs to take speech question as input and output text response. However, the lack of spoken question-answering datasets that include speech style information to supervised fine-tuning (SFT) limits the performance of these systems. As a result, while these systems excel at understanding speech content, they often struggle to generate empathetic responses. In response, we propose a novel approach that circumvents the need for question-answering data, called Listen, Perceive, and Express (LPE). Our method employs a two-stage training process, initially guiding the LLM to listen the content and perceive the emotional aspects of speech. Subsequently, we utilize Chain-of-Thought (CoT) prompting to unlock the model's potential for expressing empathetic responses based on listened spoken content and perceived emotional cues. We employ experiments to prove the effectiveness of proposed method. To our knowledge, this is the first attempt to leverage CoT for speech-based dialogue.
Chinese Summary: LPE方法通过两阶段训练使大语言模型从语音中理解内容和情感,并利用思维链提示生成共情回应,无需问答数据集即可提升语音对话系统的情感交互能力。
English Summary: The LPE method enhances empathetic dialogue in speech-based systems by training large language models to understand both content and emotion from speech, using Chain-of-Thought prompting to generate emotionally aware responses without requiring question-answering datasets.

Authors:Qianru Zhang, Xinyi Gao, Haixin Wang, Siu-Ming Yiu, Hongzhi Yin
Title: Efficient Traffic Prediction Through Spatio-Temporal Distillation
Abstract:
Graph neural networks (GNNs) have gained considerable attention in recent years for traffic flow prediction due to their ability to learn spatio-temporal pattern representations through a graph-based message-passing framework. Although GNNs have shown great promise in handling traffic datasets, their deployment in real-life applications has been hindered by scalability constraints arising from high-order message passing. Additionally, the over-smoothing problem of GNNs may lead to indistinguishable region representations as the number of layers increases, resulting in performance degradation. To address these challenges, we propose a new knowledge distillation paradigm termed LightST that transfers spatial and temporal knowledge from a high-capacity teacher to a lightweight student. Specifically, we introduce a spatio-temporal knowledge distillation framework that helps student MLPs capture graph-structured global spatio-temporal patterns while alleviating the over-smoothing effect with adaptive knowledge distillation. Extensive experiments verify that LightST significantly speeds up traffic flow predictions by 5X to 40X compared to state-of-the-art spatio-temporal GNNs, all while maintaining superior accuracy.
中文: 图神经网络在交通流预测中存在可扩展性和过度平滑问题,因此提出LightST知识蒸馏方法,将时空知识从教师模型传递给学生模型,在保持精度的同时显著提升了预测速度。
English: Graph neural networks face scalability and over-smoothing issues in traffic flow prediction, prompting the development of LightST, a knowledge distillation method that transfers spatio-temporal knowledge from a teacher to a student model, achieving faster predictions with maintained accuracy.

Authors:Cunhang Fan, Sheng Zhang, Jingjing Zhang, Zexu Pan, Zhao Lv
Title: SSM2Mel: State Space Model to Reconstruct Mel Spectrogram from the EEG
Abstract:
Decoding speech from brain signals is a challenging research problem that holds significant importance for studying speech processing in the brain. Although breakthroughs have been made in reconstructing the mel spectrograms of audio stimuli perceived by subjects at the word or letter level using noninvasive electroencephalography (EEG), there is still a critical gap in precisely reconstructing continuous speech features, especially at the minute level. To address this issue, this paper proposes a State Space Model (SSM) to reconstruct the mel spectrogram of continuous speech from EEG, named SSM2Mel. This model introduces a novel Mamba module to effectively model the long sequence of EEG signals for imagined speech. In the SSM2Mel model, the S4-UNet structure is used to enhance the extraction of local features of EEG signals, and the Embedding Strength Modulator (ESM) module is used to incorporate subject-specific information. Experimental results show that our model achieves a Pearson correlation of 0.069 on the SparrKULee dataset, which is a 38% improvement over the previous baseline.
中文摘要:本文提出SSM2Mel模型,通过创新Mamba模块从脑电信号重建连续语音的梅尔频谱图,在SparrKULee数据集上相比基线方法性能提升38%。
English Summary: This paper introduces SSM2Mel, a State Space Model that reconstructs continuous speech mel spectrograms from EEG signals using a novel Mamba module and achieves a 38% performance improvement over previous methods.

Authors:Chao Feng, Nicolas Fazli Kohler, Alberto Huertas Celdran, Gerome Bovet, Burkhard Stiller
Title: ColNet: Collaborative Optimization in Decentralized Federated Multi-task Learning Systems
Abstract:
The integration of Federated Learning (FL) and Multi-Task Learning (MTL) has been explored to address client heterogeneity, with Federated Multi-Task Learning (FMTL) treating each client as a distinct task. However, most existing research focuses on data heterogeneity (e.g., addressing non-IID data) rather than task heterogeneity, where clients solve fundamentally different tasks. Additionally, much of the work relies on centralized settings with a server managing the federation, leaving the more challenging domain of decentralized FMTL largely unexplored. Thus, this work bridges this gap by proposing ColNet, a framework designed for heterogeneous tasks in decentralized federated environments. ColNet divides models into the backbone and task-specific layers, forming groups of similar clients, with group leaders performing conflict-averse cross-group aggregation. A pool of experiments with different federations demonstrated ColNet outperforms the compared aggregation schemes in decentralized settings with label and task heterogeneity scenarios.
中文摘要:本文提出ColNet框架,通过将模型分解为共享主干和任务特定头部,采用自适应聚类和冲突规避聚合策略,在去中心化联邦多任务学习中有效处理任务异构性,在多种数据集上优于现有方案。
English Summary: This paper introduces ColNet, a decentralized federated multi-task learning framework that addresses task heterogeneity by partitioning models into shared backbones and task-specific heads, using adaptive clustering and conflict-averse aggregation to outperform existing methods across diverse datasets.

Authors:Chao Feng, Nicolas Fazli Kohler, Zhi Wang, Weijie Niu, Alberto Huertas Celdran, Gerome Bovet, Burkhard Stiller
Title: ColNet: Collaborative Optimization in Decentralized Federated Multi-task Learning Systems
Abstract:
The integration of Federated Learning (FL) and Multi-Task Learning (MTL) has been explored to address client heterogeneity, with Federated Multi-Task Learning (FMTL) treating each client as a distinct task. However, most existing research focuses on data heterogeneity (e.g., addressing non-IID data) rather than task heterogeneity, where clients solve fundamentally different tasks. Additionally, much of the work relies on centralized settings with a server managing the federation, leaving the more challenging domain of decentralized FMTL largely unexplored. Thus, this work bridges this gap by proposing ColNet, a framework designed for heterogeneous tasks in decentralized federated environments. ColNet partitions models into a backbone and task-specific heads, and uses adaptive clustering based on model and data sensitivity to form task-coherent client groups. Backbones are averaged within groups, and group leaders perform hyper-conflict-averse cross-group aggregation. Across datasets and federations, ColNet outperforms competing schemes under label and task heterogeneity and shows robustness to poisoning attacks.
中文摘要:本文提出ColNet框架,通过将模型分解为共享主干和任务特定头部,采用自适应聚类和冲突规避聚合策略,在去中心化联邦多任务学习中有效处理任务异构性,在多种数据集上优于现有方案。
English Summary: This paper introduces ColNet, a decentralized federated multi-task learning framework that addresses task heterogeneity by partitioning models into shared backbones and task-specific heads, using adaptive clustering and conflict-averse aggregation to outperform existing methods across diverse datasets.

Authors:Hongzhou Yu, Tianhao Cheng, Yingwen Wang, Wen He, Qing Wang, Ying Cheng, Yuejie Zhang, Rui Feng, Xiaobo Zhang
Title: FineMedLM-o1: Enhancing Medical Knowledge Reasoning Ability of LLM from Supervised Fine-Tuning to Test-Time Training
Abstract:
Recent advancements in large language models (LLMs) have shown promise in medical applications such as disease diagnosis and treatment planning. However, most existing medical LLMs struggle with the deep reasoning required for complex medical problems, such as differential diagnosis and medication recommendations. We propose FineMedLM-o1, which leverages high-quality medical synthetic data and long-form reasoning data for Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO), enabling advanced dialogue and deep reasoning capabilities. Additionally, we introduce Test-Time Training (TTT) in the medical domain for the first time, facilitating domain adaptation and ensuring reliable, accurate reasoning. Experimental results demonstrate that FineMedLM-o1 achieves a 23% average performance improvement over prior models on key medical benchmarks. Furthermore, the introduction of TTT provides an additional 14% performance boost, highlighting its effectiveness in enhancing medical reasoning capabilities. To support this process, we also propose a novel method for synthesizing medical dialogue. Compared to other open-source datasets, our dataset stands out as superior in both quality and complexity. The project and data will be released on GitHub.
Chinese: FineMedLM-o1新型医疗大模型通过高质量合成数据和创新训练方法显著提升医疗推理能力,在核心医疗基准测试中性能平均提升23%,并首次引入的测试时训练技术进一步带来14%的性能增益。
English: FineMedLM-o1, a new medical large language model, significantly enhances reasoning capabilities through advanced training techniques and achieves a 23% performance improvement on medical benchmarks, with Test-Time Training providing an additional 14% boost.

Authors:Yu Shi, Abdul Ali Bangash, Emad Fallahzadeh, Bram Adams, Ahmed E. Hassan
Title: HAFix: History-Augmented Large Language Models for Bug Fixing
Abstract:
Recent studies have explored the performance of Large Language Models (LLMs) on various Software Engineering (SE) tasks, such as code generation and bug fixing. However, these approaches typically rely on the context data from the current snapshot of the project, overlooking the potential of rich historical data residing in real-world software repositories. Additionally, the impact of prompt styles on LLM performance for SE tasks within a historical context remains underexplored. To address these gaps, we propose HAFix, which stands for History-Augmented LLMs on Bug Fixing, a novel approach that leverages seven individual historical heuristics associated with bugs and aggregates the results of these heuristics (HAFix-Agg) to enhance LLMs' bug-fixing capabilities. To empirically evaluate HAFix, we employ three Code LLMs (i.e., Code Llama, DeepSeek-Coder and DeepSeek-Coder-V2-Lite models) on 51 single-line Python bugs from BugsInPy and 116 single-line Java bugs from Defects4J. Our evaluation demonstrates that multiple HAFix heuristics achieve statistically significant improvements compared to a non-historical baseline inspired by GitHub Copilot. Furthermore, the aggregated HAFix variant HAFix-Agg achieves substantial improvements by combining the complementary strengths of individual heuristics, increasing bug-fixing rates by an average of 45.05% on BugsInPy and 49.92% on Defects4J relative to the corresponding baseline. Moreover, within the context of historical heuristics, we identify the Instruction prompt style as the most effective template compared to the InstructionLabel and InstructionMask for LLMs in bug fixing. Finally, we evaluate the cost of HAFix in terms of inference time and token usage, and provide a pragmatic trade-off analysis of the cost and bug-fixing performance, offering valuable insights for the practical deployment of our approach in real-world scenarios.
Chinese: 最新研究提出HAFix方法,通过利用历史缺陷数据和聚合启发式策略显著提升大语言模型的缺陷修复能力,实现最高49.92%的性能提升,同时确定了最优提示模板并为实际应用提供了成本效益分析。
English: Recent research introduces HAFix, a novel approach that leverages historical bug data and aggregated heuristics to significantly enhance large language models' bug-fixing performance, achieving up to 49.92% improvement while identifying optimal prompt styles for practical deployment.

Authors:Sizhen Bian, Vitor Fortes Rey, Siyu Yuan, Paul Lukowicz
Title: Collaborative Human Activity Recognition with Passive Inter-Body Electrostatic Field
Abstract:
The passive body-area electrostatic field has recently been aspiringly explored for wearable motion sensing, harnessing its two thrilling characteristics: full-body motion sensitivity and environmental sensitivity, which potentially empowers human activity recognition both independently and jointly from a single sensing front-end and theoretically brings significant competition against traditional inertial sensor that is incapable in environmental variations sensing. While most works focus on exploring the electrostatic field of a single body as the target, this work, for the first time, quantitatively evaluates the mutual effect of inter-body electrostatic fields and its contribution to collaborative activity recognition. A wearable electrostatic field sensing front-end and wrist-worn prototypes are built, and a sixteen-hour, manually annotated dataset is collected, involving an experiment of manipulating objects both independently and collaboratively. A regression model is finally used to recognize the collaborative activities among users. Despite the theoretical advantages of the body electrostatic field, the recognition of both single and collaborative activities shows unanticipated less-competitive recognition performance compared with the accelerometer. However, It is worth mentioning that this novel sensing modality improves the recognition F-score of user collaboration by 16\% in the fusion result of the two wearable motion sensing modalities, demonstrating the potential of bringing body electrostatic field as a complementary power-efficient signal for collaborative activity tracking using wearables.
中文: 本研究评估了体域静电场的协作活动识别潜力,发现尽管其单独性能不及传统加速度计,但与加速度计融合后可将协作识别F值提升16%,展现了其作为补充性低功耗传感方式的优势。
English: This study explores the potential of body-area electrostatic fields for collaborative activity recognition, revealing that while it underperforms traditional accelerometers alone, it enhances collaborative recognition by 16% when fused with accelerometer data, highlighting its value as a complementary sensing modality.

Authors:Bendegúz M. Györök, Jan H. Hoekstra, Johan Kon, Tamás Péni, Maarten Schoukens, Roland Tóth
Title: Orthogonal projection-based regularization for efficient model augmentation
Abstract:
Deep-learning-based nonlinear system identification has shown the ability to produce reliable and highly accurate models in practice. However, these black-box models lack physical interpretability, and a considerable part of the learning effort is often spent on capturing already expected/known behavior of the system, that can be accurately described by first-principles laws of physics. A potential solution is to directly integrate such prior physical knowledge into the model structure, combining the strengths of physics-based modeling and deep-learning-based identification. The most common approach is to use an additive model augmentation structure, where the physics-based and the machine-learning (ML) components are connected in parallel, i.e., additively. However, such models are overparametrized, training them is challenging, potentially causing the physics-based part to lose interpretability. To overcome this challenge, this paper proposes an orthogonal projection-based regularization technique to enhance parameter learning and even model accuracy in learning-based augmentation of nonlinear baseline models.
Chinese: 本文提出了一种基于正交投影的正则化技术,用于改进非线性系统深度学习辨识中的参数学习和模型精度,解决了物理与机器学习并行叠加模型中的过参数化及可解释性丧失问题。
English: This paper introduces an orthogonal projection-based regularization technique to improve parameter learning and model accuracy in deep-learning-based nonlinear system identification, addressing the overparametrization and interpretability loss in additive physics-ML models.

Authors:Shubham Aggarwal, Melih Bastopcu, Muhammad Aneeq uz Zaman, Tamer Başar, Sennur Ulukus, Nail Akar
Title: Fully Decentralized Computation Offloading in Priority-Driven Edge Computing Systems
Abstract:
We develop a novel framework for fully decentralized offloading policy design in multi-access edge computing (MEC) systems. The system comprises $N$ power-constrained user equipments (UEs) assisted by an edge server (ES) to process incoming tasks. Tasks are labeled with urgency flags, and in this paper, we classify them under three urgency levels, namely, high, moderate, and low urgency. We formulate the problem of designing computation decisions for the UEs within a large population noncooperative game framework, where each UE selfishly decides on how to split task execution between its local onboard processor and the ES. We employ the weighted average age of information (AoI) metric to quantify information freshness at the UEs. Increased onboard processing consumes more local power, while increased offloading may potentially incur a higher average AoI due to other UEs' packets being offloaded to the same ES. Thus, we use the mean-field game (MFG) formulation to compute approximate decentralized Nash equilibrium offloading and local computation policies for the UEs to balance between the information freshness and local power consumption. Finally, we provide a projected gradient descent-based algorithm to numerically assess the merits of our approach.
中文摘要:我们提出了一种多接入边缘计算系统的去中心化卸载框架,利用平均场博弈理论优化用户设备与边缘服务器间的任务处理,通过基于梯度的算法实现信息新鲜度与本地功耗的平衡。
English Summary: We propose a decentralized offloading framework for MEC systems using mean-field game theory to optimize task processing between user devices and edge servers, balancing information freshness and power consumption through a gradient-based algorithm.

Authors:Zixuan Feng, Igor Steinmacher, Marco Gerosa, Tyler Menezes, Alexander Serebrenik, Reed Milewicz, Anita Sarma
Title: The Multifaceted Nature of Mentoring in OSS: Strategies, Qualities, and Ideal Outcomes
Abstract:
Mentorship in open source software (OSS) is a vital, multifaceted process that includes onboarding newcomers, fostering skill development, and enhancing community building. This study examines task-focused mentoring strategies that help mentees complete their tasks and the ideal personal qualities and outcomes of good mentorship in OSS communities. We conducted two surveys to gather contributor perceptions: the first survey, with 70 mentors, mapped 17 mentoring challenges to 21 strategies that help support mentees. The second survey, with 85 contributors, assessed the importance of personal qualities and ideal mentorship outcomes. Our findings not only provide actionable strategies to help mentees overcome challenges and become successful contributors but also guide current and future mentors and OSS communities in understanding the personal qualities that are the cornerstone of good mentorship and the outcomes that mentor-mentee pairs should aspire to achieve.
中文: 本研究通过两项调查探讨了开源软件社区中以任务为导向的指导策略及关键个人品质,为帮助受指导者克服挑战和引导导师实现有效成果提供了可行方案。
English: This study explores task-focused mentoring strategies and essential personal qualities in open source software communities, identifying actionable approaches through dual surveys to help mentees overcome challenges and guide mentors toward effective outcomes.

Authors:Nora Gourmelon, Konrad Heidler, Erik Loebel, Daniel Cheng, Julian Klink, Anda Dong, Fei Wu, Noah Maul, Moritz Koch, Marcel Dreier, Dakota Pyles, Thorsten Seehaus, Matthias Braun, Andreas Maier, Vincent Christlein
Title: Comparison Study: Glacier Calving Front Delineation in Synthetic Aperture Radar Images With Deep Learning
Abstract:
Calving front position variation of marine-terminating glaciers is an indicator of ice mass loss and a crucial parameter in numerical glacier models. Deep Learning (DL) systems can automatically extract this position from Synthetic Aperture Radar (SAR) imagery, enabling continuous, weather- and illumination-independent, large-scale monitoring. This study presents the first comparison of DL systems on a common calving front benchmark dataset. A multi-annotator study with ten annotators is performed to contrast the best-performing DL system against human performance. The best DL model's outputs deviate 221 m on average, while the average deviation of the human annotators is 38 m. This significant difference shows that current DL systems do not yet match human performance and that further research is needed to enable fully automated monitoring of glacier calving fronts. The study of Vision Transformers, foundation models, and the inclusion and processing strategy of more information are identified as avenues for future research.
Chinese: 深度学习系统能够从卫星图像中自动追踪冰川崩解前缘,但目前精度尚不及人工标注——模型平均偏差达221米,而人工仅为38米,表明实现全自动监测仍需进一步研究。
English: Deep Learning systems can automatically track glacier calving fronts from satellite imagery but currently fall short of human accuracy, with models averaging 221-meter deviations compared to humans' 38 meters, highlighting the need for further research to achieve fully automated monitoring.

Authors:Luca Dede', Nicola Parolini, Alfio Quarteroni, Giulia Villani, Giovanni Ziarelli
Title: SEIHRDV: a multi-age multi-group epidemiological model and its validation on the COVID-19 epidemics in Italy
Abstract:
We propose a novel epidemiological model, referred to as SEIHRDV, for the numerical simulation of the COVID-19 epidemic, which we validate using data from Italy starting in September 2020. SEIHRDV features the following compartments: Susceptible (S), Exposed (E), Infectious (I), Healing (H), Recovered (R), Deceased (D) and Vaccinated (V). The model is age-stratified, as it considers the population split into 15 age groups. Moreover, it takes into account 7 different contexts of exposition to the infection (family, home, school, work, transport, leisure, other contexts), which impact on the transmission mechanism. Thanks to these features, the model can address the analysis of the epidemics and the efficacy of non-pharmaceutical interventions, as well as possible vaccination strategies and the introduction of the Green Pass, a containment measure introduced in Italy in 2021. By leveraging on the SEIHRDV model, we successfully analyzed epidemic trends during the COVID-19 outbreak from September 2020 to July 2021. The model proved instrumental in conducting comprehensive what-if studies and scenario analyses tailored to Italy and its regions. Furthermore, SEIHRDV facilitated accurate forecasting of the future potential trajectory of the epidemic, providing critical information for informed decision making and public health strategies.
中文: 我们提出了一种新颖的SEIHRDV流行病学模型,通过意大利数据验证,该模型结合年龄分层和多种暴露环境,用于分析COVID-19趋势、干预措施效果及疫苗接种策略。
English: We introduce a novel SEIHRDV epidemiological model, validated with Italian data, which incorporates age stratification and multiple exposure contexts to analyze COVID-19 trends, intervention efficacy, and vaccination strategies.

Authors:Zhangqian Bi, Yao Wan, Zhaoyang Chu, Yufei Hu, Junyi Zhang, Hongyu Zhang, Guandong Xu, Hai Jin
Title: How to Select Pre-Trained Code Models for Reuse? A Learning Perspective
Abstract:
Pre-training a language model and then fine-tuning it has shown to be an efficient and effective technique for a wide range of code intelligence tasks, such as code generation, code summarization, and vulnerability detection. However, pretraining language models on a large-scale code corpus is computationally expensive. Fortunately, many off-the-shelf Pre-trained Code Models (PCMs), such as CodeBERT, CodeT5, CodeGen, and Code Llama, have been released publicly. These models acquire general code understanding and generation capability during pretraining, which enhances their performance on downstream code intelligence tasks. With an increasing number of these public pre-trained models, selecting the most suitable one to reuse for a specific task is essential. In this paper, we systematically investigate the reusability of PCMs. We first explore three intuitive model selection methods that select by size, training data, or brute-force fine-tuning. Experimental results show that these straightforward techniques either perform poorly or suffer high costs. Motivated by these findings, we explore learning-based model selection strategies that utilize pre-trained models without altering their parameters. Specifically, we train proxy models to gauge the performance of pre-trained models, and measure the distribution deviation between a model's latent features and the task's labels, using their closeness as an indicator of model transferability. We conduct experiments on 100 widely-used opensource PCMs for code intelligence tasks, with sizes ranging from 42.5 million to 3 billion parameters. The results demonstrate that learning-based selection methods reduce selection time to 100 seconds, compared to 2,700 hours with brute-force fine-tuning, with less than 6% performance degradation across related tasks.
中文: 预训练和微调语言模型对代码任务有效,但选择合适的模型具有挑战性,因此本研究提出高效的学习型方法,大幅缩短选择时间且性能损失极小。
English: Pre-training and fine-tuning language models is effective for code tasks, but selecting the right model is challenging, so this study introduces efficient learning-based methods that drastically cut selection time with minimal performance loss.

Authors:Chao Feng, Yuanzhe Gao, Alberto Huertas Celdran, Gerome Bovet, Burkhard Stiller
Title: From Models to Network Topologies: A Topology Inference Attack in Decentralized Federated Learning
Abstract:
Federated Learning (FL) is widely recognized as a privacy-preserving Machine Learning paradigm due to its model-sharing mechanism that avoids direct data exchange. Nevertheless, model training leaves exploitable traces that can be used to infer sensitive information. In Decentralized FL (DFL), the topology, defining how participants are connected, plays a crucial role in shaping the model's privacy, robustness, and convergence. However, the topology introduces an unexplored vulnerability: attackers can exploit it to infer participant relationships and launch targeted attacks. This work uncovers the hidden risks of DFL topologies by proposing a novel Topology Inference Attack that infers the topology solely from model behavior. A taxonomy of topology inference attacks is introduced, categorizing them by the attacker's capabilities and knowledge. Practical attack strategies are designed for various scenarios, and experiments are conducted to identify key factors influencing attack success. The results demonstrate that analyzing only the model of each node can accurately infer the DFL topology, highlighting a critical privacy risk in DFL systems. These findings offer insights for improving privacy preservation in DFL environments.
中文: 本研究揭示去中心化联邦学习的拓扑结构存在新型拓扑推断攻击风险,仅通过模型行为即可准确推断参与者连接关系,凸显了严重的隐私泄露隐患。
English: This study reveals that decentralized federated learning topologies are vulnerable to a novel Topology Inference Attack, which can accurately deduce participant connections solely from model behaviors, exposing significant privacy risks.

Authors:Takashi Harada, Takehiro Motomitsu, Katsuhiko Hayashi, Yusuke Sakai, Hidetaka Kamigaito
Title: Can Impressions of Music be Extracted from Thumbnail Images?
Abstract:
In recent years, there has been a notable increase in research on machine learning models for music retrieval and generation systems that are capable of taking natural language sentences as inputs. However, there is a scarcity of large-scale publicly available datasets, consisting of music data and their corresponding natural language descriptions known as music captions. In particular, non-musical information such as suitable situations for listening to a track and the emotions elicited upon listening is crucial for describing music. This type of information is underrepresented in existing music caption datasets due to the challenges associated with extracting it directly from music data. To address this issue, we propose a method for generating music caption data that incorporates non-musical aspects inferred from music thumbnail images, and validated the effectiveness of our approach through human evaluations. Additionally, we created a dataset with approximately 360,000 captions containing non-musical aspects. Leveraging this dataset, we trained a music retrieval model and demonstrated its effectiveness in music retrieval tasks through evaluation.
Chinese: 针对音乐系统机器学习研究缺乏包含丰富非音乐信息的大规模数据集,我们提出利用音乐缩略图生成此类描述的方法,构建了约36万条标注的数据集,并通过检索任务验证了其有效性。
English: Recent machine learning research for music systems lacks large-scale datasets with rich non-musical descriptions, so we developed a method using music thumbnails to generate such captions, created a 360,000-entry dataset, and validated its effectiveness in retrieval tasks.

Authors:Xinyu Zhou, Jinglun Li, Lingyi Hong, Kaixun Jiang, Pinxue Guo, Weifeng Ge, Wenqiang Zhang
Title: DeTrack: In-model Latent Denoising Learning for Visual Object Tracking
Abstract:
Previous visual object tracking methods employ image-feature regression models or coordinate autoregression models for bounding box prediction. Image-feature regression methods heavily depend on matching results and do not utilize positional prior, while the autoregressive approach can only be trained using bounding boxes available in the training set, potentially resulting in suboptimal performance during testing with unseen data. Inspired by the diffusion model, denoising learning enhances the model's robustness to unseen data. Therefore, We introduce noise to bounding boxes, generating noisy boxes for training, thus enhancing model robustness on testing data. We propose a new paradigm to formulate the visual object tracking problem as a denoising learning process. However, tracking algorithms are usually asked to run in real-time, directly applying the diffusion model to object tracking would severely impair tracking speed. Therefore, we decompose the denoising learning process into every denoising block within a model, not by running the model multiple times, and thus we summarize the proposed paradigm as an in-model latent denoising learning process. Specifically, we propose a denoising Vision Transformer (ViT), which is composed of multiple denoising blocks. In the denoising block, template and search embeddings are projected into every denoising block as conditions. A denoising block is responsible for removing the noise in a predicted bounding box, and multiple stacked denoising blocks cooperate to accomplish the whole denoising process. Subsequently, we utilize image features and trajectory information to refine the denoised bounding box. Besides, we also utilize trajectory memory and visual memory to improve tracking stability. Experimental results validate the effectiveness of our approach, achieving competitive performance on several challenging datasets.
Chinese: 本文提出了一种用于视觉目标跟踪的去噪学习范式,通过在训练中引入噪声增强模型对未知数据的鲁棒性,并在视觉Transformer模型内部实现去噪过程分解,在保持实时性的同时取得了优越性能。
English: This paper introduces a denoising learning paradigm for visual object tracking that enhances robustness to unseen data by incorporating noise during training and decomposes the denoising process within a Vision Transformer model, achieving competitive performance while maintaining real-time efficiency.

Authors:Zhongwei Wang, Tong Wu, Zhiyong Chen, Liang Qian, Yin Xu, Meixia Tao
Title: Diffusion Model-Based Data Synthesis Aided Federated Semi-Supervised Learning
Abstract:
Federated semi-supervised learning (FSSL) is primarily challenged by two factors: the scarcity of labeled data across clients and the non-independent and identically distribution (non-IID) nature of data among clients. In this paper, we propose a novel approach, diffusion model-based data synthesis aided FSSL (DDSA-FSSL), which utilizes a diffusion model (DM) to generate synthetic data, bridging the gap between heterogeneous local data distributions and the global data distribution. In DDSA-FSSL, clients address the challenge of the scarcity of labeled data by employing a federated learning-trained classifier to perform pseudo labeling for unlabeled data. The DM is then collaboratively trained using both labeled and precision-optimized pseudo-labeled data, enabling clients to generate synthetic samples for classes that are absent in their labeled datasets. This process allows clients to generate more comprehensive synthetic datasets aligned with the global distribution. Extensive experiments conducted on multiple datasets and varying non-IID distributions demonstrate the effectiveness of DDSA-FSSL, e.g., it improves accuracy from 38.46% to 52.14% on CIFAR-10 datasets with 10% labeled data.
中文: 提出的DDSA-FSSL方法通过扩散模型生成合成数据来弥合本地与全局数据分布之间的差异,有效解决了联邦半监督学习中的标记数据稀缺和非独立同分布问题,在多个基准测试中显著提升了准确率。
English: The proposed DDSA-FSSL method addresses challenges in federated semi-supervised learning by using diffusion models to generate synthetic data that bridges local and global distribution gaps, significantly improving accuracy as demonstrated on benchmark datasets.

Authors:Jiajie Liu, Mengyuan Liu, Hong Liu, Wenhao Li
Title: TCPFormer: Learning Temporal Correlation with Implicit Pose Proxy for 3D Human Pose Estimation
Abstract:
Recent multi-frame lifting methods have dominated the 3D human pose estimation. However, previous methods ignore the intricate dependence within the 2D pose sequence and learn single temporal correlation. To alleviate this limitation, we propose TCPFormer, which leverages an implicit pose proxy as an intermediate representation. Each proxy within the implicit pose proxy can build one temporal correlation therefore helping us learn more comprehensive temporal correlation of human motion. Specifically, our method consists of three key components: Proxy Update Module (PUM), Proxy Invocation Module (PIM), and Proxy Attention Module (PAM). PUM first uses pose features to update the implicit pose proxy, enabling it to store representative information from the pose sequence. PIM then invocates and integrates the pose proxy with the pose sequence to enhance the motion semantics of each pose. Finally, PAM leverages the above mapping between the pose sequence and pose proxy to enhance the temporal correlation of the whole pose sequence. Experiments on the Human3.6M and MPI-INF-3DHP datasets demonstrate that our proposed TCPFormer outperforms the previous state-of-the-art methods.
中文: TCPFormer通过隐式姿态代理学习2D姿态序列中的多重时序相关性,其三个核心模块协同提升运动语义表征,在Human3.6M和MPI-INF-3DHP数据集上实现了最优性能。
English: TCPFormer introduces an implicit pose proxy to capture multiple temporal correlations in 2D pose sequences, outperforming state-of-the-art methods on benchmark datasets through its three-module architecture for enhanced motion representation.

Authors:Chulun Zhou, Qiujing Wang, Mo Yu, Xiaoqian Yue, Rui Lu, Jiangnan Li, Yifan Zhou, Shunchi Zhang, Jie Zhou, Wai Lam
Title: The Essence of Contextual Understanding in Theory of Mind: A Study on Question Answering with Story Characters
Abstract:
Theory-of-Mind (ToM) is a fundamental psychological capability that allows humans to understand and interpret the mental states of others. Humans infer others' thoughts by integrating causal cues and indirect clues from broad contextual information, often derived from past interactions. In other words, human ToM heavily relies on the understanding about the backgrounds and life stories of others. Unfortunately, this aspect is largely overlooked in existing benchmarks for evaluating machines' ToM capabilities, due to their usage of short narratives without global context, especially personal background of characters. In this paper, we verify the importance of comprehensive contextual understanding about personal backgrounds in ToM and assess the performance of LLMs in such complex scenarios. To achieve this, we introduce CharToM benchmark, comprising 1,035 ToM questions based on characters from classic novels. Our human study reveals a significant disparity in performance: the same group of educated participants performs dramatically better when they have read the novels compared to when they have not. In parallel, our experiments on state-of-the-art LLMs, including the very recent o1 and DeepSeek-R1 models, show that LLMs still perform notably worse than humans, despite that they have seen these stories during pre-training. This highlights the limitations of current LLMs in capturing the nuanced contextual information required for ToM reasoning.
中文摘要:本文通过CharToM基准验证了人物背景知识对心理理论能力的重要性,研究表明人类和大型语言模型在掌握完整背景信息时表现更优,但当前LLMs在理解复杂情境信息方面仍远不及人类水平。
English Summary: This paper introduces the CharToM benchmark to evaluate Theory-of-Mind capabilities, revealing that both humans and large language models perform significantly better with comprehensive character background knowledge, yet LLMs still lag behind humans in grasping nuanced contextual understanding.

Authors:Cunhang Fan, Youdian Gao, Zexu Pan, Jingjing Zhang, Hongyu Zhang, Jie Zhang, Zhao Lv
Title: Improved Feature Extraction Network for Neuro-Oriented Target Speaker Extraction
Abstract:
The recent rapid development of auditory attention decoding (AAD) offers the possibility of using electroencephalography (EEG) as auxiliary information for target speaker extraction. However, effectively modeling long sequences of speech and resolving the identity of the target speaker from EEG signals remains a major challenge. In this paper, an improved feature extraction network (IFENet) is proposed for neuro-oriented target speaker extraction, which mainly consists of a speech encoder with dual-path Mamba and an EEG encoder with Kolmogorov-Arnold Networks (KAN). We propose SpeechBiMamba, which makes use of dual-path Mamba in modeling local and global speech sequences to extract speech features. In addition, we propose EEGKAN to effectively extract EEG features that are closely related to the auditory stimuli and locate the target speaker through the subject's attention information. Experiments on the KUL and AVED datasets show that IFENet outperforms the state-of-the-art model, achieving 36\% and 29\% relative improvements in terms of scale-invariant signal-to-distortion ratio (SI-SDR) under an open evaluation condition.
Chinese: 听觉注意解码的最新进展使脑电图能够辅助目标说话人提取,但在建模长语音序列和从脑电信号中识别说话人方面仍存在挑战,因此提出了IFENet,它整合了SpeechBiMamba和EEGKAN,在基准数据集上实现了显著的性能提升。
English: Recent advances in auditory attention decoding enable EEG to assist in target speaker extraction, yet challenges persist in modeling long speech sequences and identifying speakers from EEG signals, leading to the development of IFENet, which integrates SpeechBiMamba and EEGKAN to achieve significant performance improvements on benchmark datasets.

Authors:Shuzheng Gao, Chaozheng Wang, Cuiyun Gao, Xiaoqian Jiao, Chun Yong Chong, Shan Gao, Michael Lyu
Title: The Prompt Alchemist: Automated LLM-Tailored Prompt Optimization for Test Case Generation
Abstract:
Test cases are essential for validating the reliability and quality of software applications. Recent studies have demonstrated the capability of Large Language Models (LLMs) to generate useful test cases for given source code. However, the existing work primarily relies on human-written plain prompts, which often leads to suboptimal results since the performance of LLMs can be highly influenced by the prompts. Moreover, these approaches use the same prompt for all LLMs, overlooking the fact that different LLMs might be best suited to different prompts. Given the wide variety of possible prompt formulations, automatically discovering the optimal prompt for each LLM presents a significant challenge. Although there are methods on automated prompt optimization in the natural language processing field, they are hard to produce effective prompts for the test case generation task. First, the methods iteratively optimize prompts by simply combining and mutating existing ones without proper guidance, resulting in prompts that lack diversity and tend to repeat the same errors in the generated test cases. Second, the prompts are generally lack of domain contextual knowledge, limiting LLMs' performance in the task.
中文摘要:大型语言模型虽能生成软件测试用例,但现有方法依赖通用提示导致效果不佳,且自动优化面临提示缺乏多样性和领域知识的双重困境。
English Summary: Large Language Models can generate software test cases, but their effectiveness is hindered by suboptimal prompts that lack diversity and domain knowledge, making automated prompt optimization a key challenge.

Authors:Amir M. Ebrahimi, Bram Adams, Gustavo A. Oliva, Ahmed E. Hassan
Title: A Large-Scale Exploratory Study on the Proxy Pattern in Ethereum
Abstract:
The proxy pattern is a well-known design pattern with numerous use cases in several sectors of the software industry. As such, the use of the proxy pattern is also a common approach in the development of complex decentralized applications (DApps) on the Ethereum blockchain. Despite the importance of proxy contracts, little is known about (i) how their prevalence changed over time, (ii) the ways in which developers integrate proxies in the design of DApps, and (iii) what proxy types are being most commonly leveraged by developers. This study bridges these gaps through a comprehensive analysis of Ethereum smart contracts, utilizing a dataset of 50 million contracts and 1.6 billion transactions as of September 2022. Our findings reveal that 14.2% of all deployed smart contracts are proxy contracts. We show that proxy contracts are being more actively used than non-proxy contracts. Also, the usage of proxy contracts in various contexts, transactions involving proxy contracts, and adoption of proxy contracts by users have shown an upward trend over time, peaking at the end of our study period. They are either deployed through off-chain scripts or on-chain factory contracts, with the former and latter being employed in 39.1% and 60.9% of identified usage contexts in turn. We found that while the majority (67.8%) of proxies act as an interceptor, 32.2% enables upgradeability. Proxy contracts are typically (79%) implemented based on known reference implementations with 29.4% being of type ERC-1167, a class of proxies that aims to cheaply reuse and clone contracts' functionality. Our evaluation shows that our proposed behavioral proxy detection method has a precision and recall of 100% in detecting active proxies. Finally, we derive a set of practical recommendations for developers and introduce open research questions to guide future research on the topic.
中文: 本研究通过对以太坊智能合约的分析发现,代理合约占部署总量的14.2%,其使用活跃度和采用率呈上升趋势,其中多数用作拦截器而三分之一支持可升级功能,且主要基于已知参考实现进行部署。
English: This study analyzes Ethereum smart contracts to reveal that proxy contracts constitute 14.2% of deployments, showing increased activity and adoption trends, with most serving as interceptors while a third enable upgradeability, primarily implemented through known reference implementations.

Authors:Amir M. Ebrahimi, Bram Adams, Gustavo A. Oliva, Ahmed E. Hassan
Title: UPC Sentinel: An Accurate Approach for Detecting Upgradeability Proxy Contracts in Ethereum
Abstract:
Software applications that run on a blockchain platform are known as DApps. DApps are built using smart contracts, which are immutable after deployment. Just like any real-world software system, DApps need to receive new features and bug fixes over time in order to remain useful and secure. However, Ethereum lacks native solutions for post-deployment smart contract maintenance, requiring developers to devise their own methods. A popular method is known as the upgradeability proxy contract (UPC), which involves implementing the proxy design pattern (as defined by the Gang of Four). In this method, client calls first hit a proxy contract, which then delegates calls to a certain implementation contract. Most importantly, the proxy contract can be reconfigured during runtime to delegate calls to another implementation contract, effectively enabling application upgrades. For researchers, the accurate detection of UPCs is a strong requirement in the understanding of how exactly real-world DApps are maintained over time. For practitioners, the accurate detection of UPCs is crucial for providing application behavior transparency and enabling auditing. In this paper, we introduce UPC Sentinel, a novel three-layer algorithm that utilizes both static and dynamic analysis of smart contract bytecode to accurately detect active UPCs. We evaluated UPC Sentinel using two distinct ground truth datasets. In the first dataset, our method demonstrated a near-perfect accuracy of 99%. The evaluation on the second dataset further established our method's efficacy, showing a perfect precision rate of 100% and a near-perfect recall of 99.3%, outperforming the state of the art. Finally, we discuss the potential value of UPC Sentinel in advancing future research efforts.
中文: 本文提出UPC Sentinel这一新颖的三层算法,通过静态和动态分析准确检测以太坊去中心化应用中的可升级代理合约,实现了近乎完美的准确率并超越了现有技术水平。
English: This paper introduces UPC Sentinel, a novel three-layer algorithm that accurately detects upgradeability proxy contracts in Ethereum DApps using static and dynamic analysis, achieving near-perfect accuracy and outperforming existing methods.

Authors:Evelyn Zhang, Bang Xiao, Jiayi Tang, Qianli Ma, Chang Zou, Xuefei Ning, Xuming Hu, Linfeng Zhang
Title: Token Pruning for Caching Better: 9 Times Acceleration on Stable Diffusion for Free
Abstract:
Stable Diffusion has achieved remarkable success in the field of text-to-image generation, with its powerful generative capabilities and diverse generation results making a lasting impact. However, its iterative denoising introduces high computational costs and slows generation speed, limiting broader adoption. The community has made numerous efforts to reduce this computational burden, with methods like feature caching attracting attention due to their effectiveness and simplicity. Nonetheless, simply reusing features computed at previous timesteps causes the features across adjacent timesteps to become similar, reducing the dynamics of features over time and ultimately compromising the quality of generated images. In this paper, we introduce a dynamics-aware token pruning (DaTo) approach that addresses the limitations of feature caching. DaTo selectively prunes tokens with lower dynamics, allowing only high-dynamic tokens to participate in self-attention layers, thereby extending feature dynamics across timesteps. DaTo combines feature caching with token pruning in a training-free manner, achieving both temporal and token-wise information reuse. Applied to Stable Diffusion on the ImageNet, our approach delivered a 9$\times$ speedup while reducing FID by 0.33, indicating enhanced image quality. On the COCO-30k, we observed a 7$\times$ acceleration coupled with a notable FID reduction of 2.17.
中文: 本文提出了一种动态感知令牌剪枝(DaTo)方法,通过选择性剪除低动态令牌并结合特征缓存与令牌剪枝,在无需训练的情况下显著提升了Stable Diffusion的生成速度并改善了图像质量。
English: This paper introduces a dynamics-aware token pruning (DaTo) method that enhances Stable Diffusion by selectively pruning low-dynamic tokens and combining feature caching with token pruning, achieving significant speed improvements and better image quality without training.

Authors:Haoran Wang, Pingzhi Li, Min Chen, Jinglei Cheng, Junyu Liu, Tianlong Chen
Title: GroverGPT: A Large Language Model with 8 Billion Parameters for Quantum Searching
Abstract:
Quantum computing is an exciting non-Von Neumann paradigm, offering provable speedups over classical computing for specific problems. However, the practical limits of classical simulatability for quantum circuits remain unclear, especially with current noisy quantum devices. In this work, we explore the potential of leveraging Large Language Models (LLMs) to simulate the output of a quantum Turing machine using Grover's quantum circuits, known to provide quadratic speedups over classical counterparts. To this end, we developed GroverGPT, a specialized model based on LLaMA's 8-billion-parameter architecture, trained on over 15 trillion tokens. Unlike brute-force state-vector simulations, which demand substantial computational resources, GroverGPT employs pattern recognition to approximate quantum search algorithms without explicitly representing quantum states. Analyzing 97K quantum search instances, GroverGPT consistently outperformed OpenAI's GPT-4o (45\% accuracy), achieving nearly 100\% accuracy on 6- and 10-qubit datasets when trained on 4-qubit or larger datasets. It also demonstrated strong generalization, surpassing 95\% accuracy for systems with over 20 qubits when trained on 3- to 6-qubit data. Analysis indicates GroverGPT captures quantum features of Grover's search rather than classical patterns, supported by novel prompting strategies to enhance performance. Although accuracy declines with increasing system size, these findings offer insights into the practical boundaries of classical simulatability. This work suggests task-specific LLMs can surpass general-purpose models like GPT-4o in quantum algorithm learning and serve as powerful tools for advancing quantum research.
中文摘要:本研究开发的GroverGPT专用大语言模型通过量子模式识别而非显式状态表征来模拟量子搜索算法,在保持高精度的同时显著超越GPT-4o等通用模型,既揭示了经典模拟的实践边界,也为量子研究提供了新型工具。
English Summary: This study introduces GroverGPT, a specialized large language model that simulates quantum search algorithms with high accuracy by recognizing quantum patterns rather than explicit state representation, demonstrating superior performance over general models like GPT-4o while revealing practical limits of classical quantum simulation.

Authors:Xianglin Yang, Gelei Deng, Jieming Shi, Tianwei Zhang, Jin Song Dong
Title: Enhancing Model Defense Against Jailbreaks with Proactive Safety Reasoning
Abstract:
Large language models (LLMs) are vital for a wide range of applications yet remain susceptible to jailbreak threats, which could lead to the generation of inappropriate responses. Conventional defenses, such as refusal and adversarial training, often fail to cover corner cases or rare domains, leaving LLMs still vulnerable to more sophisticated attacks. We propose a novel defense strategy, Safety Chain-of-Thought (SCoT), which harnesses the enhanced \textit{reasoning capabilities} of LLMs for proactive assessment of harmful inputs, rather than simply blocking them. SCoT augments any refusal training datasets to critically analyze the intent behind each request before generating answers. By employing proactive reasoning, SCoT enhances the generalization of LLMs across varied harmful queries and scenarios not covered in the safety alignment corpus. Additionally, it generates detailed refusals specifying the rules violated. Comparative evaluations show that SCoT significantly surpasses existing defenses, reducing vulnerability to out-of-distribution issues and adversarial manipulations while maintaining strong general capabilities.
中文: 提出的安全思维链(SCoT)防御策略利用大语言模型的推理能力主动评估有害输入,通过减少漏洞并生成详细拒绝理由,显著优于现有方法,同时保持模型的通用能力。
English: The proposed Safety Chain-of-Thought (SCoT) defense leverages LLMs' reasoning abilities to proactively assess harmful inputs, significantly outperforming existing methods by reducing vulnerabilities and generating detailed refusals while preserving general capabilities.

Authors:Lynnette Hui Xian Ng, Kathleen M. Carley
Title: Social Cyber Geographical Worldwide Inventory of Bots
Abstract:
Social Cyber Geography is the space in the digital cyber realm that is produced through social relations. Communication in the social media ecosystem happens not only because of human interactions, but is also fueled by algorithmically controlled bot agents. Most studies have not looked at the social cyber geography of bots because they focus on bot activity within a single country. Since creating a bot uses universal programming technology, bots, how prevalent are these bots throughout the world? To quantify bot activity worldwide, we perform a multilingual and geospatial analysis on a large dataset of social data collected from X during the Coronavirus pandemic in 2021. This pandemic affected most of the world, and thus is a common topic of discussion. Our dataset consists of ~100 mil posts generated by ~31mil users. Most bot studies focus only on English-speaking countries, because most bot detection algorithms are built for the English language. However, only 47\% of the bots write in the English language. To accommodate multiple languages in our bot detection algorithm, we built Multilingual BotBuster, a multi-language bot detection algorithm to identify the bots in this diverse dataset. We also create a Geographical Location Identifier to swiftly identify the countries a user affiliates with in his description. Our results show that bots can appear to move from one country to another, but the language they write in remains relatively constant. Bots distribute narratives on distinct topics related to their self-declared country affiliation. Finally, despite the diverse distribution of bot locations around the world, the proportion of bots per country is about 20%. Our work stresses the importance of a united analysis of the cyber and physical realms, where we combine both spheres to inventorize the language and location of social media bots and understand communication strategies.
中文: 本研究通过对2021年疫情期间社交媒体数据的多语言和地理空间分析发现,机器人账户在保持语言稳定性的同时会呈现跨国流动特征,并传播与其宣称所属国家相关的特定叙事,各国机器人账户比例均维持在20%左右。
English: This study introduces a multilingual and geospatial analysis of social media bots using a dataset from the 2021 pandemic, revealing that bots maintain consistent language use while appearing to shift locations and distribute country-specific narratives, with a global average of 20% bot presence per country.

Authors:Mrinank Sharma, Meg Tong, Jesse Mu, Jerry Wei, Jorrit Kruthoff, Scott Goodfriend, Euan Ong, Alwin Peng, Raj Agarwal, Cem Anil, Amanda Askell, Nathan Bailey, Joe Benton, Emma Bluemke, Samuel R. Bowman, Eric Christiansen, Hoagy Cunningham, Andy Dau, Anjali Gopal, Rob Gilson, Logan Graham, Logan Howard, Nimit Kalra, Taesung Lee, Kevin Lin, Peter Lofgren, Francesco Mosconi, Clare O'Hara, Catherine Olsson, Linda Petrini, Samir Rajani, Nikhil Saxena, Alex Silverstein, Tanya Singh, Theodore Sumers, Leonard Tang, Kevin K. Troy, Constantin Weisser, Ruiqi Zhong, Giulio Zhou, Jan Leike, Jared Kaplan, Ethan Perez
Title: Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming
Abstract:
Large language models (LLMs) are vulnerable to universal jailbreaks-prompting strategies that systematically bypass model safeguards and enable users to carry out harmful processes that require many model interactions, like manufacturing illegal substances at scale. To defend against these attacks, we introduce Constitutional Classifiers: safeguards trained on synthetic data, generated by prompting LLMs with natural language rules (i.e., a constitution) specifying permitted and restricted content. In over 3,000 estimated hours of red teaming, no red teamer found a universal jailbreak that could extract information from an early classifier-guarded LLM at a similar level of detail to an unguarded model across most target queries. On automated evaluations, enhanced classifiers demonstrated robust defense against held-out domain-specific jailbreaks. These classifiers also maintain deployment viability, with an absolute 0.38% increase in production-traffic refusals and a 23.7% inference overhead. Our work demonstrates that defending against universal jailbreaks while maintaining practical deployment viability is tractable.
中文摘要:大型语言模型易受系统性越狱攻击,但基于合成数据训练的宪法分类器能有效防御此类攻击,在保持部署可行性的同时仅产生微小性能影响。
English Summary: Large language models are susceptible to universal jailbreaks that bypass safeguards, but Constitutional Classifiers trained on synthetic data provide robust defense while maintaining practical deployment viability with minimal performance impact.

Authors:Xuejian Zhang, Ruisi He, Mi Yang, Zhengyu Zhang, Ziyi Qi, Bo Ai
Title: Vision Aided Channel Prediction for Vehicular Communications: A Case Study of Received Power Prediction Using RGB Images
Abstract:
The communication scenarios and channel characteristics of 6G will be more complex and difficult to characterize. Conventional methods for channel prediction face challenges in achieving an optimal balance between accuracy, practicality, and generalizability. Additionally, they often fail to effectively leverage environmental features. Within the framework of integration communication and artificial intelligence as a pivotal development vision for 6G, it is imperative to achieve intelligent prediction of channel characteristics. Vision-aided methods have been employed in various wireless communication tasks, excluding channel prediction, and have demonstrated enhanced efficiency and performance. In this paper, we propose a vision-aided two-stage model for channel prediction in millimeter wave vehicular communication scenarios, realizing accurate received power prediction utilizing solely RGB images. Firstly, we obtain original images of propagation environment through an RGB camera. Secondly, three typical computer vision methods including object detection, instance segmentation and binary mask are employed for environmental information extraction from original images in stage 1, and prediction of received power based on processed images is implemented in stage 2. Pre-trained YOLOv8 and ResNets are used in stages 1 and 2, respectively, and fine-tuned on datasets. Finally, we conduct five experiments to evaluate the performance of proposed model, demonstrating its feasibility, accuracy and generalization capabilities. The model proposed in this paper offers novel solutions for achieving intelligent channel prediction in vehicular communications.
中文: 本文提出了一种视觉辅助的双阶段模型,利用RGB图像实现6G车联网场景中的精确信道预测,通过计算机视觉方法展现了优越的准确性和泛化能力。
English: This paper introduces a vision-aided two-stage model that uses RGB images for accurate channel prediction in 6G vehicular communications, demonstrating improved accuracy and generalization through computer vision techniques.

Authors:Ruofan Liang, Zan Gojcic, Huan Ling, Jacob Munkberg, Jon Hasselgren, Zhi-Hao Lin, Jun Gao, Alexander Keller, Nandita Vijaykumar, Sanja Fidler, Zian Wang
Title: DiffusionRenderer: Neural Inverse and Forward Rendering with Video Diffusion Models
Abstract:
Understanding and modeling lighting effects are fundamental tasks in computer vision and graphics. Classic physically-based rendering (PBR) accurately simulates the light transport, but relies on precise scene representations--explicit 3D geometry, high-quality material properties, and lighting conditions--that are often impractical to obtain in real-world scenarios. Therefore, we introduce DiffusionRenderer, a neural approach that addresses the dual problem of inverse and forward rendering within a holistic framework. Leveraging powerful video diffusion model priors, the inverse rendering model accurately estimates G-buffers from real-world videos, providing an interface for image editing tasks, and training data for the rendering model. Conversely, our rendering model generates photorealistic images from G-buffers without explicit light transport simulation. Experiments demonstrate that DiffusionRenderer effectively approximates inverse and forwards rendering, consistently outperforming the state-of-the-art. Our model enables practical applications from a single video input--including relighting, material editing, and realistic object insertion.
中文:DiffusionRenderer提出了一种神经框架,可同时从视频中逆向估算场景属性并正向生成逼真图像,其性能超越现有方法,能够基于单一视频输入实现多种实际编辑应用。
English: DiffusionRenderer introduces a neural framework that simultaneously handles inverse rendering to estimate scene properties from videos and forward rendering to generate photorealistic images, outperforming existing methods and enabling practical editing applications from single video inputs.

Authors:Placido Falqueto, Alberto Sanfeliu, Luigi Palopoli, Daniele Fontanelli
Title: Learning Priors of Human Motion With Vision Transformers
Abstract:
A clear understanding of where humans move in a scenario, their usual paths and speeds, and where they stop, is very important for different applications, such as mobility studies in urban areas or robot navigation tasks within human-populated environments. We propose in this article, a neural architecture based on Vision Transformers (ViTs) to provide this information. This solution can arguably capture spatial correlations more effectively than Convolutional Neural Networks (CNNs). In the paper, we describe the methodology and proposed neural architecture and show the experiments' results with a standard dataset. We show that the proposed ViT architecture improves the metrics compared to a method based on a CNN.
本文提出了一种基于视觉变换器的神经网络架构,能有效捕捉空间相关性以分析人类移动模式,实验证明其在性能上优于基于卷积神经网络的方法。
This article introduces a Vision Transformer-based neural architecture that effectively captures spatial correlations to analyze human movement patterns, demonstrating superior performance over CNN-based methods in experiments.

Authors:Matteo Dalle Vedove, Matteo Bonetto, Edoardo Lamon, Luigi Palopoli, Matteo Saveriano, Daniele Fontanelli
Title: Surface Defect Identification using Bayesian Filtering on a 3D Mesh
Abstract:
This paper presents a CAD-based approach for automated surface defect detection. We leverage the a-priori knowledge embedded in a CAD model and integrate it with point cloud data acquired from commercially available stereo and depth cameras. The proposed method first transforms the CAD model into a high-density polygonal mesh, where each vertex represents a state variable in 3D space. Subsequently, a weighted least squares algorithm is employed to iteratively estimate the state of the scanned workpiece based on the captured point cloud measurements. This framework offers the potential to incorporate information from diverse sensors into the CAD domain, facilitating a more comprehensive analysis. Preliminary results demonstrate promising performance, with the algorithm achieving convergence to a sub-millimeter standard deviation in the region of interest using only approximately 50 point cloud samples. This highlights the potential of utilising commercially available stereo cameras for high-precision quality control applications.
本文提出了一种基于CAD的方法,通过将CAD模型数据与点云测量相结合来自动检测表面缺陷,利用立体相机以少量样本实现亚毫米级精度。
This paper introduces a CAD-based method that integrates CAD model data with point cloud measurements to automatically detect surface defects, achieving sub-millimeter precision with minimal samples using stereo cameras.

Authors:Vilém Zouhar, Peng Cui, Mrinmaya Sachan
Title: How to Select Datapoints for Efficient Human Evaluation of NLG Models?
Abstract:
Human evaluation is the gold standard for evaluating text generation models. However, it is expensive. In order to fit budgetary constraints, a random subset of the test data is often chosen in practice for human evaluation. However, randomly selected data may not accurately represent test performance, making this approach economically inefficient for model comparison. Thus, in this work, we develop and analyze a suite of selectors to get the most informative datapoints for human evaluation, taking the evaluation costs into account. We show that selectors based on variance in automated metric scores, diversity in model outputs, or Item Response Theory outperform random selection. We further develop an approach to distill these selectors to the scenario where the model outputs are not yet available. In particular, we introduce source-based estimators, which predict item usefulness for human evaluation just based on the source texts. We demonstrate the efficacy of our selectors in two common NLG tasks, machine translation and summarization, and show that only $\sim$70\% of the test data is needed to produce the same evaluation result as the entire data.
人工评估是评估文本生成模型的金标准,但成本高昂,因此开发了选择器来识别最具信息量的数据点,可将所需测试数据减少约30%,同时保持评估准确性。
Human evaluation is the gold standard for assessing text generation models but is costly, leading to the development of selectors that identify the most informative data points, reducing the required test data by about 30% while maintaining evaluation accuracy.

Authors:Swarnadeep Saha, Xian Li, Marjan Ghazvininejad, Jason Weston, Tianlu Wang
Title: Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge
Abstract:
LLM-as-a-Judge models generate chain-of-thought (CoT) sequences intended to capture the step-bystep reasoning process that underlies the final evaluation of a response. However, due to the lack of human annotated CoTs for evaluation, the required components and structure of effective reasoning traces remain understudied. Consequently, previous approaches often (1) constrain reasoning traces to hand-designed components, such as a list of criteria, reference answers, or verification questions and (2) structure them such that planning is intertwined with the reasoning for evaluation. In this work, we propose EvalPlanner, a preference optimization algorithm for Thinking-LLM-as-a-Judge that first generates an unconstrained evaluation plan, followed by its execution, and then the final judgment. In a self-training loop, EvalPlanner iteratively optimizes over synthetically constructed evaluation plans and executions, leading to better final verdicts. Our method achieves a new state-of-the-art performance for generative reward models on RewardBench (with a score of 93.9), despite being trained on fewer amount of, and synthetically generated, preference pairs. Additional experiments on other benchmarks like RM-Bench, JudgeBench, and FollowBenchEval further highlight the utility of both planning and reasoning for building robust LLM-as-a-Judge reasoning models.
中文:EvalPlanner提出了一种偏好优化算法,将评估计划生成与执行分离,通过在合成数据上的自训练迭代优化,在RewardBench上创下性能新高,并在多个基准测试中展现出更强的鲁棒性。
English: EvalPlanner introduces a preference optimization algorithm that separates planning from reasoning execution, achieving state-of-the-art performance on RewardBench and demonstrating enhanced robustness across multiple benchmarks through synthetic self-training.

Authors:Yoshua Bengio, Sören Mindermann, Daniel Privitera, Tamay Besiroglu, Rishi Bommasani, Stephen Casper, Yejin Choi, Philip Fox, Ben Garfinkel, Danielle Goldfarb, Hoda Heidari, Anson Ho, Sayash Kapoor, Leila Khalatbari, Shayne Longpre, Sam Manning, Vasilios Mavroudis, Mantas Mazeika, Julian Michael, Jessica Newman, Kwan Yee Ng, Chinasa T. Okolo, Deborah Raji, Girish Sastry, Elizabeth Seger, Theodora Skeadas, Tobin South, Emma Strubell, Florian Tramèr, Lucia Velasco, Nicole Wheeler, Daron Acemoglu, Olubayo Adekanmbi, David Dalrymple, Thomas G. Dietterich, Edward W. Felten, Pascale Fung, Pierre-Olivier Gourinchas, Fredrik Heintz, Geoffrey Hinton, Nick Jennings, Andreas Krause, Susan Leavy, Percy Liang, Teresa Ludermir, Vidushi Marda, Helen Margetts, John McDermid, Jane Munga, Arvind Narayanan, Alondra Nelson, Clara Neppel, Alice Oh, Gopal Ramchurn, Stuart Russell, Marietje Schaake, Bernhard Schölkopf, Dawn Song, Alvaro Soto, Lee Tiedrich, Gaël Varoquaux, Andrew Yao, Ya-Qin Zhang, Fahad Albalawi, Marwan Alserkal, Olubunmi Ajala, Guillaume Avrin, Christian Busch, André Carlos Ponce de Leon Ferreira de Carvalho, Bronwyn Fox, Amandeep Singh Gill, Ahmet Halit Hatip, Juha Heikkilä, Gill Jolly, Ziv Katzir, Hiroaki Kitano, Antonio Krüger, Chris Johnson, Saif M. Khan, Kyoung Mu Lee, Dominic Vincent Ligot, Oleksii Molchanovskyi, Andrea Monti, Nusu Mwamanzi, Mona Nemer, Nuria Oliver, José Ramón López Portillo, Balaraman Ravindran, Raquel Pezoa Rivera, Hammam Riza, Crystal Rugege, Ciarán Seoighe, Jerry Sheehan, Haroon Sheikh, Denise Wong, Yi Zeng
Title: International AI Safety Report
Abstract:
The first International AI Safety Report comprehensively synthesizes the current evidence on the capabilities, risks, and safety of advanced AI systems. The report was mandated by the nations attending the AI Safety Summit in Bletchley, UK. Thirty nations, the UN, the OECD, and the EU each nominated a representative to the report's Expert Advisory Panel. A total of 100 AI experts contributed, representing diverse perspectives and disciplines. Led by the report's Chair, these independent experts collectively had full discretion over the report's content.
中文: 首份《国际人工智能安全报告》受人工智能安全峰会参与国委托,汇集了来自30个国家及国际组织的百名独立专家意见,系统整合了当前关于先进人工智能系统能力、风险与安全性的研究成果。
English: The inaugural International AI Safety Report, commissioned by nations from the AI Safety Summit, consolidates current findings on advanced AI systems' capabilities, risks, and safety through contributions from 100 independent experts across 30 countries and international organizations.

Authors:Yubo Wang, Xiang Yue, Wenhu Chen
Title: Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate
Abstract:
Supervised Fine-Tuning (SFT) is commonly used to train language models to imitate annotated responses for given instructions. In this paper, we propose Critique Fine-Tuning (CFT), a method more effective than SFT for reasoning tasks. Instead of simply imitating correct responses, CFT trains models to critique noisy responses, inspired by human learning processes that emphasize critical thinking, deeper analysis, and nuanced understanding - traits often overlooked by standard SFT. To validate the effectiveness of CFT, we construct multiple critique datasets (e.g., WebInstruct, MetaMath, NuminaMath), where GPT-4o serves as the teacher to generate critiques in the form of ([query; noisy response], critique). Experiments on these datasets demonstrate that CFT consistently outperforms SFT by 4-10% across six mathematical reasoning benchmarks, and is effective across different base models including Qwen2.5, Qwen2.5-Math, and DeepSeek-Math. Notably, our model Qwen2.5-Math-CFT only requires 1 hour of training on 8 x H100 over the 50K examples, yet matches or outperforms strong competitors like Qwen2.5-Math-Instruct on most benchmarks, which use over 2M samples. Moreover, it matches the performance of SimpleRL, which is a DeepSeek-r1 replication trained with 140 x more compute. Experiments on IF_Eval and MT-Bench further demonstrate that CFT can significantly enhance the model's general generation and instruction-following capabilities, outperforming the Qwen2.5-Math-Instruct by a large margin. Ablation studies show that CFT is robust to noisy response sources and teacher critique models. These findings highlight that CFT offers a more effective alternative to advance the reasoning of language models.
Chinese: 本文提出批判性微调(CFT)方法,通过训练语言模型批判含噪回答而非简单模仿正确答案,在推理任务中持续优于监督微调4-10%,且所需训练数据和计算资源显著更少。
English: This paper introduces Critique Fine-Tuning (CFT), a method that trains language models to critique noisy responses instead of merely imitating correct ones, consistently outperforming Supervised Fine-Tuning by 4-10% on reasoning tasks while requiring significantly less training data and compute.

Authors:Xiaozhou Li, Noman Ahmad, Tomas Cerny, Andrea Janes, Valentina Lenarduzzi, Davide Taibi
Title: Toward Organizational Decoupling in Microservices Through Key Developer Allocation
Abstract:
With microservices continuously being popular in the software architecture domain, more practitioners and researchers have begun to pay attention to the degradation issue that diminishes its sustainability. One of the key factors that causes the degradation of the architecture is that of the software architectural structure according to Conway's law. However, the best practice of "One microservice per Team", advocated widely by the industry, is not commonly adopted, especially when many developers contribute heavily across multiple microservices and create organizational coupling. Therein, many key developers, who are responsible for the majority of the project work and irreplaceable to the team, can also create the most coupling and be the primary cause of microservice degradation. Hence, to properly maintain microservice architecture in terms of its organizational structure, we shall identify these key developers and understand their connections to the organizational coupling within the project. We propose an approach to identify the key developers in microservice projects and investigate their connection to organizational coupling. The approach shall facilitate the maintenance and optimization of microservice projects against degradation by detecting and mitigating organizational coupling.
中文摘要:微服务架构的退化常由组织耦合引起,尤其是跨多个服务的关键开发者,识别这些人员有助于维护和优化系统以应对退化问题。
English Summary: The degradation of microservice architecture is often caused by organizational coupling, particularly from key developers who work across multiple services, and identifying these individuals can help maintain and optimize the system.

Authors:Sania Waheed, Bruno Ferrarini, Michael Milford, Sarvapali D. Ramchurn, Shoaib Ehsan
Title: Image-based Geo-localization for Robotics: Are Black-box Vision-Language Models there yet?
Abstract:
The advances in Vision-Language models (VLMs) offer exciting opportunities for robotic applications involving image geo-localization, the problem of identifying the geo-coordinates of a place based on visual data only. Recent research works have focused on using a VLM as embeddings extractor for geo-localization, however, the most sophisticated VLMs may only be available as black boxes that are accessible through an API, and come with a number of limitations: there is no access to training data, model features and gradients; retraining is not possible; the number of predictions may be limited by the API; training on model outputs is often prohibited; and queries are open-ended. The utilization of a VLM as a stand-alone, zero-shot geo-localization system using a single text-based prompt is largely unexplored. To bridge this gap, this paper undertakes the first systematic study, to the best of our knowledge, to investigate the potential of some of the state-of-the-art VLMs as stand-alone, zero-shot geo-localization systems in a black-box setting with realistic constraints. We consider three main scenarios for this thorough investigation: a) fixed text-based prompt; b) semantically-equivalent text-based prompts; and c) semantically-equivalent query images. We also take into account the auto-regressive and probabilistic generation process of the VLMs when investigating their utility for geo-localization task by using model consistency as a metric in addition to traditional accuracy. Our work provides new insights in the capabilities of different VLMs for the above-mentioned scenarios.
中文摘要:本文首次系统研究了在现实黑盒约束下,将先进视觉语言模型作为独立零样本地理定位系统的潜力,通过多种文本提示和查询图像场景进行评估,并引入模型一致性作为新评估指标。
English Summary: This paper conducts the first systematic study exploring state-of-the-art Vision-Language Models as standalone zero-shot geo-localization systems under black-box constraints, evaluating their performance across different prompt and image scenarios while introducing model consistency as a key metric.

Authors:Yijiang Liu, Hengyu Fang, Liulu He, Rongyu Zhang, Yichuan Bai, Yuan Du, Li Du
Title: FBQuant: FeedBack Quantization for Large Language Models
Abstract:
Deploying Large Language Models (LLMs) on edge devices is increasingly important, as it eliminates reliance on network connections, reduces expensive API calls, and enhances user privacy. However, on-device deployment is challenging due to the limited computational resources of edge devices. In particular, the key bottleneck stems from memory bandwidth constraints related to weight loading. Weight-only quantization effectively reduces memory access, yet often induces significant accuracy degradation. Recent efforts to incorporate sub-branches have shown promise for mitigating quantization errors, but these methods either lack robust optimization strategies or rely on suboptimal objectives. To address these gaps, we propose FeedBack Quantization (FBQuant), a novel approach inspired by negative feedback mechanisms in automatic control. FBQuant inherently ensures that the reconstructed weights remain bounded by the quantization process, thereby reducing the risk of overfitting. To further offset the additional latency introduced by sub-branches, we develop an efficient CUDA kernel that decreases 60% of extra inference time. Comprehensive experiments demonstrate the efficiency and effectiveness of FBQuant across various LLMs. Notably, for 3-bit Llama2-7B, FBQuant improves zero-shot accuracy by 1.2%.
中文摘要:FBQuant是一种创新的量化方法,通过反馈机制和优化的CUDA内核,在边缘设备上部署大语言模型时减少内存访问和精度损失,使3位Llama2-7B的零样本准确率提升1.2%。
English Summary: FBQuant is a novel quantization method for deploying LLMs on edge devices that reduces memory access and accuracy loss by using a feedback mechanism and an optimized CUDA kernel, improving zero-shot accuracy by 1.2% for 3-bit Llama2-7B.

Authors:Benjamin C. Carter, Jonathan Rivas Contreras, Carlos A. Llanes Villegas, Pawan Acharya, Jack Utzerath, Adonijah O. Farner, Hunter Jenkins, Dylan Johnson, Jacob Penney, Igor Steinmacher, Marco A. Gerosa, Fabio Santos
Title: SkillScope: A Tool to Predict Fine-Grained Skills Needed to Solve Issues on GitHub
Abstract:
New contributors often struggle to find tasks that they can tackle when onboarding onto a new Open Source Software (OSS) project. One reason for this difficulty is that issue trackers lack explanations about the knowledge or skills needed to complete a given task successfully. These explanations can be complex and time-consuming to produce. Past research has partially addressed this problem by labeling issues with issue types, issue difficulty level, and issue skills. However, current approaches are limited to a small set of labels and lack in-depth details about their semantics, which may not sufficiently help contributors identify suitable issues. To surmount this limitation, this paper explores large language models (LLMs) and Random Forest (RF) to predict the multilevel skills required to solve the open issues. We introduce a novel tool, SkillScope, which retrieves current issues from Java projects hosted on GitHub and predicts the multilevel programming skills required to resolve these issues. In a case study, we demonstrate that SkillScope could predict 217 multilevel skills for tasks with 91% precision, 88% recall, and 89% F-measure on average. Practitioners can use this tool to better delegate or choose tasks to solve in OSS projects.
中文:本文介绍了SkillScope工具,它运用大语言模型和随机森林预测开源任务所需的多层级编程技能,以高准确率帮助贡献者筛选合适的工作项。
English: This paper introduces SkillScope, a tool that uses large language models and Random Forest to predict multilevel programming skills needed for open source tasks, achieving high accuracy in helping contributors select suitable issues.

Authors:Kun Li, Longtao Hu, Xiantao Cai, Jia Wu, Wenbin Hu
Title: Can Molecular Evolution Mechanism Enhance Molecular Representation?
Abstract:
Molecular evolution is the process of simulating the natural evolution of molecules in chemical space to explore potential molecular structures and properties. The relationships between similar molecules are often described through transformations such as adding, deleting, and modifying atoms and chemical bonds, reflecting specific evolutionary paths. Existing molecular representation methods mainly focus on mining data, such as atomic-level structures and chemical bonds directly from the molecules, often overlooking their evolutionary history. Consequently, we aim to explore the possibility of enhancing molecular representations by simulating the evolutionary process. We extract and analyze the changes in the evolutionary pathway and explore combining it with existing molecular representations. Therefore, this paper proposes the molecular evolutionary network (MEvoN) for molecular representations. First, we construct the MEvoN using molecules with a small number of atoms and generate evolutionary paths utilizing similarity calculations. Then, by modeling the atomic-level changes, MEvoN reveals their impact on molecular properties. Experimental results show that the MEvoN-based molecular property prediction method significantly improves the performance of traditional end-to-end algorithms on several molecular datasets. The code is available at https://anonymous.4open.science/r/MEvoN-7416/.
中文摘要:本文提出分子进化网络(MEvoN),通过模拟分子进化过程并与现有表征方法结合,在多个分子数据集上显著提升了分子属性预测的性能。
English Summary: This paper introduces the Molecular Evolutionary Network (MEvoN), which enhances molecular representations by simulating evolutionary processes and integrating them with existing methods, significantly improving property prediction performance on multiple datasets.

Authors:Xuejian Zhang, Ruisi He, Mi Yang, Jianwen Ding, Ruifeng Chen, Shuaiqi Gao, Ziyi Qi, Zhengyu Zhang, Bo Ai, Zhangdui Zhong
Title: Measurement-Based Non-Stationary Markov Tapped Delay Line Channel Model for 5G-Railways
Abstract:
5G for Railways (5G-R) is globally recognized as a promising next-generation railway communication system designed to meet increasing demands. Channel modeling serves as foundation for communication system design, with tapped delay line (TDL) models widely utilized in system simulations due to their simplicity and practicality and serves as a crucial component of various standards like 3GPP. However, existing TDL models applicable to 5G-R systems are limited. Most fail to capture non-stationarity, a critical characteristic of railway communications, while others are unsuitable for the specific frequency bands and bandwidths of 5G-R. In this paper, a channel measurement campaign for 5G-R dedicated network is carried out, resulting in a measurement-based 5-tap TDL model utilizing a first-order two-state Markov chain to represent channel non stationarity. Key model parameters, including number of taps, statistical distribution of amplitude, phase and Doppler shift, and state transition probability matrix, are extracted. The correlation between tap amplitudes are also obtained. Finally, accuracy of model is validated through comparisons with measurement data and 3GPP model. These findings are expected to offer valuable insights for design, optimization, and link-level simulation and validation of 5G-R systems.
中文: 本文提出了一种基于测量的5抽头TDL信道模型,采用马尔可夫链表征5G-R系统的非平稳特性,并通过实测数据与3GPP模型对比验证了模型准确性。
English: This paper introduces a 5-tap TDL channel model for 5G-R systems, incorporating a Markov chain to address non-stationarity and validating its accuracy against measurement data and 3GPP standards.

Authors:Xuejian Zhang, Ruisi He, Mi Yang, Ziyi Qi, Zhengyu Zhang, Bo Ai, Zhangdui Zhong
Title: Vision-Aided Channel Prediction Based on Image Segmentation at Street Intersection Scenarios
Abstract:
Intelligent vehicular communication with vehicle road collaboration capability is a key technology enabled by 6G, and the integration of various visual sensors on vehicles and infrastructures plays a crucial role. Moreover, accurate channel prediction is foundational to realizing intelligent vehicular communication. Traditional methods are still limited by the inability to balance accuracy and operability based on substantial spectrum resource consumption and highly refined description of environment. Therefore, leveraging out-of-band information introduced by visual sensors provides a new solution and is increasingly applied across various communication tasks. In this paper, we propose a computer vision (CV)-based prediction model for vehicular communications, realizing accurate channel characterization prediction including path loss, Rice K-factor and delay spread based on image segmentation. First, we conduct extensive vehicle-to-infrastructure measurement campaigns, collecting channel and visual data from various street intersection scenarios. The image-channel dataset is generated after a series of data post-processing steps. Image data consists of individual segmentation of target user using YOLOv8 network. Subsequently, established dataset is used to train and test prediction network ResNet-32, where segmented images serve as input of network, and various channel characteristics are treated as labels or target outputs of network. Finally, self-validation and cross-validation experiments are performed. The results indicate that models trained with segmented images achieve high prediction accuracy and remarkable generalization performance across different streets and target users. The model proposed in this paper offers novel solutions for achieving intelligent channel prediction in vehicular communications.
中文摘要:本文提出了一种基于计算机视觉的车联网预测模型,通过图像分割技术实现对路径损耗、莱斯K因子等信道特性的精准预测,验证实验表明该模型具有高精度和优异的泛化性能。
English Summary: This paper introduces a computer vision-based prediction model for intelligent vehicular communication that utilizes image segmentation to accurately forecast channel characteristics like path loss and delay spread, demonstrating high accuracy and generalization through validation experiments.

Authors:Yinan Zheng, Ruiming Liang, Kexin Zheng, Jinliang Zheng, Liyuan Mao, Jianxiong Li, Weihao Gu, Rui Ai, Shengbo Eben Li, Xianyuan Zhan, Jingjing Liu
Title: Diffusion-Based Planning for Autonomous Driving with Flexible Guidance
Abstract:
Achieving human-like driving behaviors in complex open-world environments is a critical challenge in autonomous driving. Contemporary learning-based planning approaches such as imitation learning methods often struggle to balance competing objectives and lack of safety assurance,due to limited adaptability and inadequacy in learning complex multi-modal behaviors commonly exhibited in human planning, not to mention their strong reliance on the fallback strategy with predefined rules. We propose a novel transformer-based Diffusion Planner for closed-loop planning, which can effectively model multi-modal driving behavior and ensure trajectory quality without any rule-based refinement. Our model supports joint modeling of both prediction and planning tasks under the same architecture, enabling cooperative behaviors between vehicles. Moreover, by learning the gradient of the trajectory score function and employing a flexible classifier guidance mechanism, Diffusion Planner effectively achieves safe and adaptable planning behaviors. Evaluations on the large-scale real-world autonomous planning benchmark nuPlan and our newly collected 200-hour delivery-vehicle driving dataset demonstrate that Diffusion Planner achieves state-of-the-art closed-loop performance with robust transferability in diverse driving styles.
中文摘要:提出的基于Transformer的扩散规划器能有效建模多模态驾驶行为并确保轨迹安全而无需基于规则的优化,在自动驾驶基准测试中实现了最先进的性能表现。
English Summary: The proposed transformer-based Diffusion Planner effectively models multi-modal driving behaviors and ensures trajectory safety without rule-based refinement, achieving state-of-the-art performance in autonomous driving benchmarks.

Authors:Zewen Bai, Shengdi Yin, Junyu Lu, Jingjie Zeng, Haohao Zhu, Yuanyuan Sun, Liang Yang, Hongfei Lin
Title: STATE ToxiCN: A Benchmark for Span-level Target-Aware Toxicity Extraction in Chinese Hate Speech Detection
Abstract:
The proliferation of hate speech has caused significant harm to society. The intensity and directionality of hate are closely tied to the target and argument it is associated with. However, research on hate speech detection in Chinese has lagged behind, and existing datasets lack span-level fine-grained annotations. Furthermore, the lack of research on Chinese hateful slang poses a significant challenge. In this paper, we provide a solution for fine-grained detection of Chinese hate speech. First, we construct a dataset containing Target-Argument-Hateful-Group quadruples (STATE ToxiCN), which is the first span-level Chinese hate speech dataset. Secondly, we evaluate the span-level hate speech detection performance of existing models using STATE ToxiCN. Finally, we conduct the first study on Chinese hateful slang and evaluate the ability of LLMs to detect such expressions. Our work contributes valuable resources and insights to advance span-level hate speech detection in Chinese.
中文:本研究通过构建首个中文细粒度仇恨言论数据集STATE ToxiCN,评估了现有模型的检测能力及大语言模型对中文恶意俚语的识别效果,填补了相关研究空白。
English: This study addresses the gap in Chinese hate speech detection by creating the first span-level annotated dataset, STATE ToxiCN, and evaluating model performance and LLMs' ability to identify hateful slang.

Authors:Tianming Liang, Kun-Yu Lin, Chaolei Tan, Jianguo Zhang, Wei-Shi Zheng, Jian-Fang Hu
Title: ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations
Abstract:
Referring video object segmentation (RVOS) aims to segment target objects throughout a video based on a text description. This is challenging as it involves deep vision-language understanding, pixel-level dense prediction and spatiotemporal reasoning. Despite notable progress in recent years, existing methods still exhibit a noticeable gap when considering all these aspects. In this work, we propose \textbf{ReferDINO}, a strong RVOS model that inherits region-level vision-language alignment from foundational visual grounding models, and is further endowed with pixel-level dense perception and cross-modal spatiotemporal reasoning. In detail, ReferDINO integrates two key components: 1) a grounding-guided deformable mask decoder that utilizes location prediction to progressively guide mask prediction through differentiable deformation mechanisms; 2) an object-consistent temporal enhancer that injects pretrained time-varying text features into inter-frame interaction to capture object-aware dynamic changes. Moreover, a confidence-aware query pruning strategy is designed to accelerate object decoding without compromising model performance. Extensive experimental results on five benchmarks demonstrate that our ReferDINO significantly outperforms previous methods (e.g., +3.9% (\mathcal{J}&\mathcal{F}) on Ref-YouTube-VOS) with real-time inference speed (51 FPS).
中文:ReferDINO是一种新颖的参照视频目标分割模型,通过结合定位引导的掩码解码和对象一致的时间增强技术,实现了卓越性能和实时处理速度。
English: ReferDINO is a novel referring video object segmentation model that achieves superior performance and real-time speed by integrating grounding-guided mask decoding and object-consistent temporal enhancement.

Authors:Faiz Muhammad Chaudhry, Jarno Ralli, Jerome Leudet, Fahad Sohrab, Farhad Pakdaman, Pierre Corbani, Moncef Gabbouj
Title: Deep-BrownConrady: Prediction of Camera Calibration and Distortion Parameters Using Deep Learning and Synthetic Data
Abstract:
This research addresses the challenge of camera calibration and distortion parameter prediction from a single image using deep learning models. The main contributions of this work are: (1) demonstrating that a deep learning model, trained on a mix of real and synthetic images, can accurately predict camera and lens parameters from a single image, and (2) developing a comprehensive synthetic dataset using the AILiveSim simulation platform. This dataset includes variations in focal length and lens distortion parameters, providing a robust foundation for model training and testing. The training process predominantly relied on these synthetic images, complemented by a small subset of real images, to explore how well models trained on synthetic data can perform calibration tasks on real-world images. Traditional calibration methods require multiple images of a calibration object from various orientations, which is often not feasible due to the lack of such images in publicly available datasets. A deep learning network based on the ResNet architecture was trained on this synthetic dataset to predict camera calibration parameters following the Brown-Conrady lens model. The ResNet architecture, adapted for regression tasks, is capable of predicting continuous values essential for accurate camera calibration in applications such as autonomous driving, robotics, and augmented reality. Keywords: Camera calibration, distortion, synthetic data, deep learning, residual networks (ResNet), AILiveSim, horizontal field-of-view, principal point, Brown-Conrady Model.
Chinese: 本研究通过结合合成与真实图像训练深度学习模型,实现了从单张图像准确预测相机标定参数,突破了传统多图像标定方法的局限。
English: This study develops a deep learning model using synthetic and real images to accurately predict camera calibration parameters from a single image, overcoming the limitations of traditional multi-image methods.

Authors:Julien Kindle, Michael Loetscher, Andrea Alessandretti, Cesar Cadena, Marco Hutter
Title: Enhancing Robotic Precision in Construction: A Modular Factor Graph-Based Framework to Deflection and Backlash Compensation Using High-Accuracy Accelerometers
Abstract:
Accurate positioning is crucial in the construction industry, where labor shortages highlight the need for automation. Robotic systems with long kinematic chains are required to reach complex workspaces, including floors, walls, and ceilings. These requirements significantly impact positioning accuracy due to effects such as deflection and backlash in various parts along the kinematic chain. In this work, we introduce a novel approach that integrates deflection and backlash compensation models with high-accuracy accelerometers, significantly enhancing position accuracy. Our method employs a modular framework based on a factor graph formulation to estimate the state of the kinematic chain, leveraging acceleration measurements to inform the model. Extensive testing on publicly released datasets, reflecting real-world construction disturbances, demonstrates the advantages of our approach. The proposed method reduces the $95\%$ error threshold in the xy-plane by $50\%$ compared to the state-of-the-art Virtual Joint Method, and by $31\%$ when incorporating base tilt compensation.
Chinese: 本研究提出了一种创新方法,将偏转和间隙补偿模型与高精度加速度计相结合,在建筑机器人定位中显著提升了精度,使xy平面95%误差阈值相比现有技术降低了50%。
English: This study presents a novel method that integrates deflection and backlash compensation models with high-accuracy accelerometers, significantly improving robotic positioning accuracy in construction by reducing the 95% error threshold in the xy-plane by 50% compared to existing techniques.

Authors:Jiacheng Liang, Yuhui Wang, Changjiang Li, Rongyi Zhu, Tanqiu Jiang, Neil Gong, Ting Wang
Title: GraphRAG under Fire
Abstract:
GraphRAG advances retrieval-augmented generation (RAG) by structuring external knowledge as multi-scale knowledge graphs, enabling language models to integrate both broad context and granular details in their generation. While GraphRAG has demonstrated success across domains, its security implications remain largely unexplored. To bridge this gap, this work examines GraphRAG's vulnerability to poisoning attacks, uncovering an intriguing security paradox: existing RAG poisoning attacks are less effective under GraphRAG than conventional RAG, due to GraphRAG's graph-based indexing and retrieval; yet, the same features also create new attack surfaces. We present GragPoison, a novel attack that exploits shared relations in the underlying knowledge graph to craft poisoning text capable of compromising multiple queries simultaneously. GragPoison employs three key strategies: (i) relation injection to introduce false knowledge, (ii) relation enhancement to amplify poisoning influence, and (iii) narrative generation to embed malicious content within coherent text. Empirical evaluation across diverse datasets and models shows that GragPoison substantially outperforms existing attacks in terms of effectiveness (up to 98% success rate) and scalability (using less than 68% poisoning text) on multiple variations of GraphRAG. We also explore potential defensive measures and their limitations, identifying promising directions for future research.
中文: GraphRAG通过将知识构建为多尺度图谱来增强检索生成能力,但也暴露出新的安全漏洞,易受GragPoison等攻击利用共享关系同时破坏多个查询,具有高成功率和低投毒文本量的特点。
English: GraphRAG enhances retrieval-augmented generation by structuring knowledge into multi-scale graphs, but it introduces new vulnerabilities to poisoning attacks like GragPoison, which exploits shared relations to compromise multiple queries with high success rates and efficiency.

Authors:Seyoung Song, Haneul Yoo, Jiho Jin, Kyunghyun Cho, Alice Oh
Title: HERITAGE: An End-to-End Web Platform for Processing Korean Historical Documents in Hanja
Abstract:
While Korean historical documents are invaluable cultural heritage, understanding those documents requires in-depth Hanja expertise. Hanja is an ancient language used in Korea before the 20th century, whose characters were borrowed from old Chinese but had evolved in Korea for centuries. Modern Koreans and Chinese cannot understand Korean historical documents without substantial additional help, and while previous efforts have produced some Korean and English translations, this requires in-depth expertise, and so most of the documents are not translated into any modern language. To address this gap, we present HERITAGE, the first open-source Hanja NLP toolkit to assist in understanding and translating the unexplored Korean historical documents written in Hanja. HERITAGE is a web-based platform providing model predictions of three critical tasks in historical document understanding via Hanja language models: punctuation restoration, named entity recognition, and machine translation (MT). HERITAGE also provides an interactive glossary, which provides the character-level reading of the Hanja characters in modern Korean, as well as character-level English definition. HERITAGE serves two purposes. First, anyone interested in these documents can get a general understanding from the model predictions and the interactive glossary, especially MT outputs in Korean and English. Second, since the model outputs are not perfect, Hanja experts can revise them to produce better annotations and translations. This would boost the translation efficiency and potentially lead to most of the historical documents being translated into modern languages, lowering the barrier on unexplored Korean historical documents.
Chinese: HERITAGE 是一个开源的韩文汉字自然语言处理工具包,旨在通过提供标点恢复、命名实体识别、机器翻译和交互式词汇表,弥合理解韩国历史文献的鸿沟,使这些文本对普通用户和专家都更易获取。
English: HERITAGE is an open-source Hanja NLP toolkit designed to bridge the gap in understanding Korean historical documents by providing punctuation restoration, named entity recognition, machine translation, and an interactive glossary, making these texts more accessible to both general users and experts.

Authors:Zhifeng Kong, Kevin J Shih, Weili Nie, Arash Vahdat, Sang-gil Lee, Joao Felipe Santos, Ante Jukic, Rafael Valle, Bryan Catanzaro
Title: A2SB: Audio-to-Audio Schrodinger Bridges
Abstract:
Real-world audio is often degraded by numerous factors. This work presents an audio restoration model tailored for high-res music at 44.1kHz. Our model, Audio-to-Audio Schrödinger Bridges (A2SB), is capable of both bandwidth extension (predicting high-frequency components) and inpainting (re-generating missing segments). Critically, A2SB is end-to-end requiring no vocoder to predict waveform outputs, able to restore hour-long audio inputs, and trained on permissively licensed music data. A2SB is capable of achieving state-of-the-art band-width extension and inpainting quality on several out-of-distribution music test sets.
中文: 本研究提出的A2SB音频修复模型无需声码器即可实现端到端的带宽扩展和音频修复,在多个非分布音乐测试集上达到最先进的音质恢复效果。
English: This work introduces A2SB, an end-to-end audio restoration model that achieves state-of-the-art performance in bandwidth extension and inpainting for high-resolution music without requiring a vocoder.

Authors:Kanika Goswami, Puneet Mathur, Ryan Rossi, Franck Dernoncourt
Title: PlotEdit: Natural Language-Driven Accessible Chart Editing in PDFs via Multimodal LLM Agents
Abstract:
Chart visualizations, while essential for data interpretation and communication, are predominantly accessible only as images in PDFs, lacking source data tables and stylistic information. To enable effective editing of charts in PDFs or digital scans, we present PlotEdit, a novel multi-agent framework for natural language-driven end-to-end chart image editing via self-reflective LLM agents. PlotEdit orchestrates five LLM agents: (1) Chart2Table for data table extraction, (2) Chart2Vision for style attribute identification, (3) Chart2Code for retrieving rendering code, (4) Instruction Decomposition Agent for parsing user requests into executable steps, and (5) Multimodal Editing Agent for implementing nuanced chart component modifications - all coordinated through multimodal feedback to maintain visual fidelity. PlotEdit outperforms existing baselines on the ChartCraft dataset across style, layout, format, and data-centric edits, enhancing accessibility for visually challenged users and improving novice productivity.
中文摘要:PlotEdit提出了一种多智能体框架,通过自反思的大语言模型代理实现基于自然语言的PDF图表图像端到端编辑,在多种编辑任务上超越现有方法。
English Summary: PlotEdit introduces a multi-agent framework using self-reflective LLM agents to enable natural language-based end-to-end editing of chart images from PDFs, outperforming existing methods across various editing tasks.

Authors:Yiyao Yu, Yuxiang Zhang, Dongdong Zhang, Xiao Liang, Hengyuan Zhang, Xingxing Zhang, Ziyi Yang, Mahmoud Khademi, Hany Awadalla, Junjie Wang, Yujiu Yang, Furu Wei
Title: Chain-of-Reasoning: Towards Unified Mathematical Reasoning in Large Language Models via a Multi-Paradigm Perspective
Abstract:
Large Language Models (LLMs) have made notable progress in mathematical reasoning, yet often rely on single-paradigm reasoning, limiting their effectiveness across diverse tasks. We introduce Chain-of-Reasoning (CoR), a novel unified framework integrating multiple reasoning paradigms--Natural Language Reasoning (NLR), Algorithmic Reasoning (AR), and Symbolic Reasoning (SR)--to enable synergistic collaboration. CoR generates multiple potential answers via different reasoning paradigms and synthesizes them into a coherent final solution. We propose a Progressive Paradigm Training (PPT) strategy for models to progressively master these paradigms, leading to CoR-Math-7B. Experimental results demonstrate that CoR-Math-7B significantly outperforms current SOTA models, achieving up to a 41.0% absolute improvement over GPT-4o in theorem proving and a 15.0% improvement over RL-based methods on the MATH benchmark in arithmetic tasks. These results show the enhanced mathematical comprehension ability of our model, enabling zero-shot generalization across tasks.
Chinese: Chain-of-Reasoning (CoR) 框架通过整合多种推理范式提升数学问题解决能力,在定理证明和算术任务上显著超越了GPT-4o和基于强化学习的最先进模型。
English: The Chain-of-Reasoning (CoR) framework integrates multiple reasoning paradigms to enhance mathematical problem-solving, achieving significant performance improvements over state-of-the-art models like GPT-4o and RL-based methods.

Authors:Daochang Liu, Junyu Zhang, Anh-Dung Dinh, Eunbyung Park, Shichao Zhang, Ajmal Mian, Mubarak Shah, Chang Xu
Title: Generative Physical AI in Vision: A Survey
Abstract:
Generative Artificial Intelligence (AI) has rapidly advanced the field of computer vision by enabling machines to create and interpret visual data with unprecedented sophistication. This transformation builds upon a foundation of generative models to produce realistic images, videos, and 3D/4D content. Conventional generative models primarily focus on visual fidelity while often neglecting the physical plausibility of the generated content. This gap limits their effectiveness in applications that require adherence to real-world physical laws, such as robotics, autonomous systems, and scientific simulations. As generative models evolve to increasingly integrate physical realism and dynamic simulation, their potential to function as "world simulators" expands. Therefore, the field of physics-aware generation in computer vision is rapidly growing, calling for a comprehensive survey to provide a structured analysis of current efforts. To serve this purpose, the survey presents a systematic review, categorizing methods based on how they incorporate physical knowledge, either through explicit simulation or implicit learning. It also analyzes key paradigms, discusses evaluation protocols, and identifies future research directions. By offering a comprehensive overview, this survey aims to help future developments in physically grounded generation for computer vision. The reviewed papers are summarized at https://tinyurl.com/Physics-Aware-Generation.
中文: 生成式人工智能通过创造逼真视觉内容推动计算机视觉发展,但现有模型常缺乏物理合理性,因此本综述系统评述了融合物理知识的方法,以指导该领域的未来研究。
English: Generative AI is advancing computer vision by creating realistic visual content, but current models often lack physical plausibility, prompting a survey that reviews physics-aware methods to guide future developments in the field.

Authors:Junshi Xia, Hongruixuan Chen, Clifford Broni-Bediako, Yimin Wei, Jian Song, Naoto Yokoya
Title: OpenEarthMap-SAR: A Benchmark Synthetic Aperture Radar Dataset for Global High-Resolution Land Cover Mapping
Abstract:
High-resolution land cover mapping plays a crucial role in addressing a wide range of global challenges, including urban planning, environmental monitoring, disaster response, and sustainable development. However, creating accurate, large-scale land cover datasets remains a significant challenge due to the inherent complexities of geospatial data, such as diverse terrain, varying sensor modalities, and atmospheric conditions. Synthetic Aperture Radar (SAR) imagery, with its ability to penetrate clouds and capture data in all-weather, day-and-night conditions, offers unique advantages for land cover mapping. Despite these strengths, the lack of benchmark datasets tailored for SAR imagery has limited the development of robust models specifically designed for this data modality. To bridge this gap and facilitate advancements in SAR-based geospatial analysis, we introduce OpenEarthMap-SAR, a benchmark SAR dataset, for global high-resolution land cover mapping. OpenEarthMap-SAR consists of 1.5 million segments of 5033 aerial and satellite images with the size of 1024$\times$1024 pixels, covering 35 regions from Japan, France, and the USA, with partially manually annotated and fully pseudo 8-class land cover labels at a ground sampling distance of 0.15--0.5 m. We evaluated the performance of state-of-the-art methods for semantic segmentation and present challenging problem settings suitable for further technical development. The dataset also serves the official dataset for IEEE GRSS Data Fusion Contest Track I. The dataset has been made publicly available at https://zenodo.org/records/14622048.
高分辨率土地覆盖测绘对解决全球问题至关重要,但专用SAR数据集的缺乏阻碍了进展,因此推出OpenEarthMap-SAR,通过大量标注图像推动SAR分析的发展。
High-resolution land cover mapping is essential for tackling global issues, but the scarcity of specialized SAR datasets has hindered progress, prompting the introduction of OpenEarthMap-SAR to advance SAR-based analysis with extensive, annotated imagery.

Authors:Shuzhou Sun, Li Liu, Yongxiang Liu, Zhen Liu, Shuanghui Zhang, Janne Heikkilä, Xiang Li
Title: Uncovering Bias in Foundation Models: Impact, Testing, Harm, and Mitigation
Abstract:
Bias in Foundation Models (FMs) - trained on vast datasets spanning societal and historical knowledge - poses significant challenges for fairness and equity across fields such as healthcare, education, and finance. These biases, rooted in the overrepresentation of stereotypes and societal inequalities in training data, exacerbate real-world discrimination, reinforce harmful stereotypes, and erode trust in AI systems. To address this, we introduce Trident Probe Testing (TriProTesting), a systematic testing method that detects explicit and implicit biases using semantically designed probes. Here we show that FMs, including CLIP, ALIGN, BridgeTower, and OWLv2, demonstrate pervasive biases across single and mixed social attributes (gender, race, age, and occupation). Notably, we uncover mixed biases when social attributes are combined, such as gender x race, gender x age, and gender x occupation, revealing deeper layers of discrimination. We further propose Adaptive Logit Adjustment (AdaLogAdjustment), a post-processing technique that dynamically redistributes probability power to mitigate these biases effectively, achieving significant improvements in fairness without retraining models. These findings highlight the urgent need for ethical AI practices and interdisciplinary solutions to address biases not only at the model level but also in societal structures. Our work provides a scalable and interpretable solution that advances fairness in AI systems while offering practical insights for future research on fair AI technologies.
中文摘要:基础模型因训练数据存在普遍偏见而加剧社会歧视,但提出的三叉戟探针测试和自适应逻辑调整方法能有效检测并缓解这些偏见,推动人工智能公平性发展。
English Summary: Foundation models exhibit pervasive biases from training data that perpetuate discrimination, but the proposed Trident Probe Testing and Adaptive Logit Adjustment methods effectively detect and mitigate these biases to advance AI fairness.

Authors:Andreea Musulan, Veronica Xia, Ethan Kosak-Hine, Tom Gibbs, Vidya Sujaya, Reihaneh Rabbany, Jean-François Godbout, Kellin Pelrine
Title: Online Influence Campaigns: Strategies and Vulnerabilities
Abstract:
In order to combat the creation and spread of harmful content online, this paper defines and contextualizes the concept of inauthentic, societal-scale manipulation by malicious actors. We review the literature on societally harmful content and how it proliferates to analyze the manipulation strategies used by such actors and the vulnerabilities they target. We also provide an overview of three case studies of extensive manipulation campaigns to emphasize the severity of the problem. We then address the role that Artificial Intelligence plays in the development and dissemination of harmful content, and how its evolution presents new threats to societal cohesion for countries across the globe. Our survey aims to increase our understanding of not just particular aspects of these threats, but also the strategies underlying their deployment, so we can effectively prepare for the evolving cybersecurity landscape.
本文探讨恶意行为者如何利用虚假内容进行社会操纵,通过案例研究分析其策略,并强调人工智能在加剧这些全球网络安全威胁中日益重要的作用。
This paper examines how malicious actors use inauthentic content to manipulate societies, analyzing their strategies through case studies and highlighting AI's growing role in escalating these global cybersecurity threats.

Authors:Ilias Diakonikolas, Nikos Zarifis
Title: A Near-optimal Algorithm for Learning Margin Halfspaces with Massart Noise
Abstract:
We study the problem of PAC learning $γ$-margin halfspaces in the presence of Massart noise. Without computational considerations, the sample complexity of this learning problem is known to be $\widetildeΘ(1/(γ^2 ε))$. Prior computationally efficient algorithms for the problem incur sample complexity $\tilde{O}(1/(γ^4 ε^3))$ and achieve 0-1 error of $η+ε$, where $η<1/2$ is the upper bound on the noise rate. Recent work gave evidence of an information-computation tradeoff, suggesting that a quadratic dependence on $1/ε$ is required for computationally efficient algorithms. Our main result is a computationally efficient learner with sample complexity $\widetildeΘ(1/(γ^2 ε^2))$, nearly matching this lower bound. In addition, our algorithm is simple and practical, relying on online SGD on a carefully selected sequence of convex losses.
中文: 本研究提出了一种计算高效的PAC学习算法,用于处理带Massart噪声的γ间隔半空间问题,通过在线随机梯度下降优化实现了$\widetilde{Θ}(1/(γ^2 ε^2))$的样本复杂度,几乎达到了理论下界。
English: This research presents a computationally efficient algorithm for PAC learning γ-margin halfspaces with Massart noise, achieving sample complexity of $\widetilde{Θ}(1/(γ^2 ε^2))$ that nearly matches the theoretical lower bound through online SGD optimization.

Authors:Sanchit Sinha, Guangzhi Xiong, Aidong Zhang
Title: ASCENT-ViT: Attention-based Scale-aware Concept Learning Framework for Enhanced Alignment in Vision Transformers
Abstract:
As Vision Transformers (ViTs) are increasingly adopted in sensitive vision applications, there is a growing demand for improved interpretability. This has led to efforts to forward-align these models with carefully annotated abstract, human-understandable semantic entities - concepts. Concepts provide global rationales to the model predictions and can be quickly understood/intervened on by domain experts. Most current research focuses on designing model-agnostic, plug-and-play generic concept-based explainability modules that do not incorporate the inner workings of foundation models (e.g., inductive biases, scale invariance, etc.) during training. To alleviate this issue for ViTs, in this paper, we propose ASCENT-ViT, an attention-based, concept learning framework that effectively composes scale and position-aware representations from multiscale feature pyramids and ViT patch representations, respectively. Further, these representations are aligned with concept annotations through attention matrices - which incorporate spatial and global (semantic) concepts. ASCENT-ViT can be utilized as a classification head on top of standard ViT backbones for improved predictive performance and accurate and robust concept explanations as demonstrated on five datasets, including three widely used benchmarks (CUB, Pascal APY, Concept-MNIST) and 2 real-world datasets (AWA2, KITS).
Chinese: 为提高视觉Transformer在敏感应用中的可解释性,本文提出ASCENT-ViT这一基于注意力的概念学习框架,通过融合多尺度特征与补丁表示来对齐概念标注,在多个数据集上实现了预测性能和概念解释能力的同步提升。
English: To enhance the interpretability of Vision Transformers in sensitive applications, this paper introduces ASCENT-ViT, an attention-based concept learning framework that integrates multiscale features and patch representations with concept annotations, improving both predictive performance and concept explanations across multiple datasets.

Authors:Weixi Feng, Chao Liu, Sifei Liu, William Yang Wang, Arash Vahdat, Weili Nie
Title: BlobGEN-Vid: Compositional Text-to-Video Generation with Blob Video Representations
Abstract:
Existing video generation models struggle to follow complex text prompts and synthesize multiple objects, raising the need for additional grounding input for improved controllability. In this work, we propose to decompose videos into visual primitives - blob video representation, a general representation for controllable video generation. Based on blob conditions, we develop a blob-grounded video diffusion model named BlobGEN-Vid that allows users to control object motions and fine-grained object appearance. In particular, we introduce a masked 3D attention module that effectively improves regional consistency across frames. In addition, we introduce a learnable module to interpolate text embeddings so that users can control semantics in specific frames and obtain smooth object transitions. We show that our framework is model-agnostic and build BlobGEN-Vid based on both U-Net and DiT-based video diffusion models. Extensive experimental results show that BlobGEN-Vid achieves superior zero-shot video generation ability and state-of-the-art layout controllability on multiple benchmarks. When combined with an LLM for layout planning, our framework even outperforms proprietary text-to-video generators in terms of compositional accuracy.
Chinese: 现有视频生成模型难以遵循复杂文本提示并合成多个对象,因此我们开发了基于视觉基元(blob)的BlobGEN-Vid视频扩散模型,通过掩码3D注意力模块和可学习插值技术显著提升区域一致性和语义控制,在多项基准测试中实现了领先的零样本生成与布局控制能力。
English: Current video generation models face challenges in adhering to complex text prompts and synthesizing multiple objects, prompting the development of BlobGEN-Vid, a blob-grounded video diffusion model that enhances controllability through visual primitives and achieves superior zero-shot generation and layout control.

Authors:Runpu Wei, Zijin Yin, Shuo Zhang, Lanxiang Zhou, Xueyi Wang, Chao Ban, Tianwei Cao, Hao Sun, Zhongjiang He, Kongming Liang, Zhanyu Ma
Title: OmniEraser: Remove Objects and Their Effects in Images with Paired Video-Frame Data
Abstract:
Inpainting algorithms have achieved remarkable progress in removing objects from images, yet still face two challenges: 1) struggle to handle the object's visual effects such as shadow and reflection; 2) easily generate shape-like artifacts and unintended content. In this paper, we propose Video4Removal, a large-scale dataset comprising over 100,000 high-quality samples with realistic object shadows and reflections. By constructing object-background pairs from video frames with off-the-shelf vision models, the labor costs of data acquisition can be significantly reduced. To avoid generating shape-like artifacts and unintended content, we propose Object-Background Guidance, an elaborated paradigm that takes both the foreground object and background images. It can guide the diffusion process to harness richer contextual information. Based on the above two designs, we present OmniEraser, a novel method that seamlessly removes objects and their visual effects using only object masks as input. Extensive experiments show that OmniEraser significantly outperforms previous methods, particularly in complex in-the-wild scenes. And it also exhibits a strong generalization ability in anime-style images. Datasets, models, and codes will be published.
中文: 本文提出了OmniEraser方法,通过新数据集和引导范式有效去除图像中的物体及其视觉效应(如阴影和反射),在复杂场景中显著优于现有方法,并展现出强大的泛化能力。
English: The paper introduces OmniEraser, a method that effectively removes objects and their visual effects like shadows and reflections from images using a novel dataset and guidance paradigm, outperforming previous approaches in complex scenes and demonstrating strong generalization.

Authors:Qian Gong, Zhe Wang, Viktor Reshniak, Xin Liang, Jieyang Chen, Qing Liu, Tushar M. Athawale, Yi Ju, Anand Rangarajan, Sanjay Ranka, Norbert Podhorszki, Rick Archibald, Scott Klasky
Title: A General Framework for Error-controlled Unstructured Scientific Data Compression
Abstract:
Data compression plays a key role in reducing storage and I/O costs. Traditional lossy methods primarily target data on rectilinear grids and cannot leverage the spatial coherence in unstructured mesh data, leading to suboptimal compression ratios. We present a multi-component, error-bounded compression framework designed to enhance the compression of floating-point unstructured mesh data, which is common in scientific applications. Our approach involves interpolating mesh data onto a rectilinear grid and then separately compressing the grid interpolation and the interpolation residuals. This method is general, independent of mesh types and typologies, and can be seamlessly integrated with existing lossy compressors for improved performance. We evaluated our framework across twelve variables from two synthetic datasets and two real-world simulation datasets. The results indicate that the multi-component framework consistently outperforms state-of-the-art lossy compressors on unstructured data, achieving, on average, a $2.3-3.5\times$ improvement in compression ratios, with error bounds ranging from $\num{1e-6}$ to $\num{1e-2}$. We further investigate the impact of hyperparameters, such as grid spacing and error allocation, to deliver optimal compression ratios in diverse datasets.
Chinese: 本文提出一种多组件、误差有界的压缩框架,通过将非结构化网格数据插值到规则网格并分别压缩网格和残差,显著提升了浮点数据的压缩效率,相比现有最优方法压缩比提高了2.3-3.5倍。
English: This paper introduces a multi-component, error-bounded compression framework that enhances compression of floating-point unstructured mesh data by interpolating it onto a rectilinear grid and separately compressing the grid and residuals, achieving 2.3-3.5× better compression ratios than state-of-the-art methods.

Authors:Kun Yang, Jing Yang, Cong Shen
Title: Average Reward Reinforcement Learning for Wireless Radio Resource Management
Abstract:
In this paper, we address a crucial but often overlooked issue in applying reinforcement learning (RL) to radio resource management (RRM) in wireless communications: the mismatch between the discounted reward RL formulation and the undiscounted goal of wireless network optimization. To the best of our knowledge, we are the first to systematically investigate this discrepancy, starting with a discussion of the problem formulation followed by simulations that quantify the extent of the gap. To bridge this gap, we introduce the use of average reward RL, a method that aligns more closely with the long-term objectives of RRM. We propose a new method called the Average Reward Off policy Soft Actor Critic (ARO SAC) is an adaptation of the well known Soft Actor Critic algorithm in the average reward framework. This new method achieves significant performance improvement our simulation results demonstrate a 15% gain in the system performance over the traditional discounted reward RL approach, underscoring the potential of average reward RL in enhancing the efficiency and effectiveness of wireless network optimization.
中文摘要:本文针对强化学习在无线通信资源管理中奖励折扣目标与网络优化长期目标不匹配的问题,首次提出平均奖励框架下的新型ARO SAC算法,实验证明其性能比传统方法提升15%。
English Summary: This paper identifies the mismatch between discounted reward reinforcement learning and the undiscounted goals of wireless network optimization, proposing a new Average Reward Off-policy Soft Actor-Critic (ARO SAC) method that achieves 15% performance improvement over traditional approaches.

Authors:Zhonghao Yan, Zijin Yin, Tianyu Lin, Xiangzhu Zeng, Kongming Liang, Zhanyu Ma
Title: PGP-SAM: Prototype-Guided Prompt Learning for Efficient Few-Shot Medical Image Segmentation
Abstract:
The Segment Anything Model (SAM) has demonstrated strong and versatile segmentation capabilities, along with intuitive prompt-based interactions. However, customizing SAM for medical image segmentation requires massive amounts of pixel-level annotations and precise point- or box-based prompt designs. To address these challenges, we introduce PGP-SAM, a novel prototype-based few-shot tuning approach that uses limited samples to replace tedious manual prompts. Our key idea is to leverage inter- and intra-class prototypes to capture class-specific knowledge and relationships. We propose two main components: (1) a plug-and-play contextual modulation module that integrates multi-scale information, and (2) a class-guided cross-attention mechanism that fuses prototypes and features for automatic prompt generation. Experiments on a public multi-organ dataset and a private ventricle dataset demonstrate that PGP-SAM achieves superior mean Dice scores compared with existing prompt-free SAM variants, while using only 10\% of the 2D slices.
Chinese: PGP-SAM提出了一种基于原型的少样本调优方法,利用类间和类内原型自动生成提示,在仅使用10%二维切片的情况下,相比现有SAM变体在医学图像分割上实现了更优的性能。
English: PGP-SAM introduces a prototype-based few-shot tuning method that utilizes inter- and intra-class prototypes to automate prompt generation, achieving superior segmentation performance on medical images with only 10% of 2D slices compared to existing SAM variants.

Authors:Wenshu Fan, Minxing Zhang, Hongwei Li, Wenbo Jiang, Hanxiao Chen, Xiangyu Yue, Michael Backes, Xiao Zhang
Title: DivTrackee versus DynTracker: Promoting Diversity in Anti-Facial Recognition against Dynamic FR Strategy
Abstract:
The widespread adoption of facial recognition (FR) models raises serious concerns about their potential misuse, motivating the development of anti-facial recognition (AFR) to protect user facial privacy. In this paper, we argue that the static FR strategy, predominantly adopted in prior literature for evaluating AFR efficacy, cannot faithfully characterize the actual capabilities of determined trackers who aim to track a specific target identity. In particular, we introduce DynTracker, a dynamic FR strategy where the model's gallery database is iteratively updated with newly recognized target identity images. Surprisingly, such a simple approach renders all the existing AFR protections ineffective. To mitigate the privacy threats posed by DynTracker, we advocate for explicitly promoting diversity in the AFR-protected images. We hypothesize that the lack of diversity is the primary cause of the failure of existing AFR methods. Specifically, we develop DivTrackee, a novel method for crafting diverse AFR protections that builds upon a text-guided image generation framework and diversity-promoting adversarial losses. Through comprehensive experiments on various image benchmarks and feature extractors, we demonstrate DynTracker's strength in breaking existing AFR methods and the superiority of DivTrackee in preventing user facial images from being identified by dynamic FR strategies. We believe our work can act as an important initial step towards developing more effective AFR methods for protecting user facial privacy against determined trackers.
中文摘要:本文提出DynTracker动态人脸识别策略,通过迭代更新数据库使现有反人脸识别方法失效,并开发DivTrackee新型保护方法,通过增强保护图像的多样性来应对这种威胁。
English Summary: This paper introduces DynTracker, a dynamic facial recognition strategy that defeats existing anti-facial recognition methods by iteratively updating its database, and proposes DivTrackee, a novel protection method that enhances diversity in protected images to counter this threat.

Authors:Tsui Qin Mok, Shuyong Gao, Haozhe Xing, Miaoyang He, Yan Wang, Wenqiang Zhang
Title: A Holistically Point-guided Text Framework for Weakly-Supervised Camouflaged Object Detection
Abstract:
Weakly-Supervised Camouflaged Object Detection (WSCOD) has gained popularity for its promise to train models with weak labels to segment objects that visually blend into their surroundings. Recently, some methods using sparsely-annotated supervision shown promising results through scribbling in WSCOD, while point-text supervision remains underexplored. Hence, this paper introduces a novel holistically point-guided text framework for WSCOD by decomposing into three phases: segment, choose, train. Specifically, we propose Point-guided Candidate Generation (PCG), where the point's foreground serves as a correction for the text path to explicitly correct and rejuvenate the loss detection object during the mask generation process (SEGMENT). We also introduce a Qualified Candidate Discriminator (QCD) to choose the optimal mask from a given text prompt using CLIP (CHOOSE), and employ the chosen pseudo mask for training with a self-supervised Vision Transformer (TRAIN). Additionally, we developed a new point-supervised dataset (P2C-COD) and a text-supervised dataset (T-COD). Comprehensive experiments on four benchmark datasets demonstrate our method outperforms state-of-the-art methods by a large margin, and also outperforms some existing fully-supervised camouflaged object detection methods.
中文: 本文提出了一种新颖的全方位点引导文本框架用于弱监督伪装目标检测,通过分解为分割、选择和训练三个阶段,显著超越了现有方法的性能。
English: This paper introduces a novel holistically point-guided text framework for Weakly-Supervised Camouflaged Object Detection, which decomposes the process into segment, choose, and train phases to significantly outperform existing methods.

Authors:Jiageng Li, Zhen Dong, Chong Wang, Haozhen You, Cen Zhang, Yang Liu, Xin Peng
Title: LLM Based Input Space Partitioning Testing for Library APIs
Abstract:
Automated library APIs testing is difficult as it requires exploring a vast space of parameter inputs that may involve objects with complex data types. Existing search based approaches, with limited knowledge of relations between object states and program branches, often suffer from the low efficiency issue, i.e., tending to generate invalid inputs. Symbolic execution based approaches can effectively identify such relations, but fail to scale to large programs. In this work, we present an LLM-based input space partitioning testing approach, LISP, for library APIs. The approach leverages LLMs to understand the code of a library API under test and perform input space partitioning based on its understanding and rich common knowledge. Specifically, we provide the signature and code of the API under test to LLMs, with the expectation of obtaining a text description of each input space partition of theAPI under test. Then, we generate inputs through employing the generated text description to sample inputs from each partition, ultimately resulting in test suites that systematically explore the program behavior of the API. We evaluate LISP on more than 2,205 library API methods taken from 10 popular open-source Java libraries (e.g.,apache/commonslang with 2.6k stars, guava with 48.8k stars on GitHub). Our experiment results show that LISP is effective in library API testing. It significantly outperforms state-of-the-art tool EvoSuite in terms of edge coverage. On average, LISP achieves 67.82% branch coverage, surpassing EvoSuite by 1.21 times. In total, LISP triggers 404 exceptions or errors in the experiments, and discovers 13 previously unknown vulnerabilities during evaluation, which have been assigned CVE IDs.
中文: 本文提出LISP方法,利用大语言模型理解库API代码并进行输入空间划分,在自动化测试中显著优于现有工具,实现了更高的分支覆盖率并发现了多个未知漏洞。
English: This paper introduces LISP, an LLM-based input space partitioning approach for automated library API testing that leverages large language models to understand code and partition input spaces, achieving superior branch coverage and discovering multiple vulnerabilities compared to existing tools.

Authors:Valentin De Bortoli, Alexandre Galashov, Arthur Gretton, Arnaud Doucet
Title: Accelerated Diffusion Models via Speculative Sampling
Abstract:
Speculative sampling is a popular technique for accelerating inference in Large Language Models by generating candidate tokens using a fast draft model and accepting or rejecting them based on the target model's distribution. While speculative sampling was previously limited to discrete sequences, we extend it to diffusion models, which generate samples via continuous, vector-valued Markov chains. In this context, the target model is a high-quality but computationally expensive diffusion model. We propose various drafting strategies, including a simple and effective approach that does not require training a draft model and is applicable out of the box to any diffusion model. Our experiments demonstrate significant generation speedup on various diffusion models, halving the number of function evaluations, while generating exact samples from the target model.
中文: 推测采样技术扩展至扩散模型,通过无需额外训练的简单草案策略,显著加速生成精确样本。
English: Speculative sampling is extended to diffusion models, enabling significant speedup in generating exact samples by using a simple draft strategy without additional training.

Authors:Jiaxing Li, Wei Liu, Chao Xue, Yibing Zhan, Xiaoxing Wang, Weifeng Liu, Dacheng Tao
Title: Modeling All Response Surfaces in One for Conditional Search Spaces
Abstract:
Bayesian Optimization (BO) is a sample-efficient black-box optimizer commonly used in search spaces where hyperparameters are independent. However, in many practical AutoML scenarios, there will be dependencies among hyperparameters, forming a conditional search space, which can be partitioned into structurally distinct subspaces. The structure and dimensionality of hyperparameter configurations vary across these subspaces, challenging the application of BO. Some previous BO works have proposed solutions to develop multiple Gaussian Process models in these subspaces. However, these approaches tend to be inefficient as they require a substantial number of observations to guarantee each GP's performance and cannot capture relationships between hyperparameters across different subspaces. To address these issues, this paper proposes a novel approach to model the response surfaces of all subspaces in one, which can model the relationships between hyperparameters elegantly via a self-attention mechanism. Concretely, we design a structure-aware hyperparameter embedding to preserve the structural information. Then, we introduce an attention-based deep feature extractor, capable of projecting configurations with different structures from various subspaces into a unified feature space, where the response surfaces can be formulated using a single standard Gaussian Process. The empirical results on a simulation function, various real-world tasks, and HPO-B benchmark demonstrate that our proposed approach improves the efficacy and efficiency of BO within conditional search spaces.
中文摘要:本文提出了一种新颖的贝叶斯优化方法,通过自注意力机制和结构感知嵌入将条件搜索空间建模到统一特征空间中,在多个基准测试中显著提升了优化效率与效果。
English Summary: This paper introduces a novel Bayesian Optimization method that uses a self-attention mechanism and structure-aware embeddings to model conditional search spaces within a unified feature space, improving both efficiency and effectiveness across various benchmarks.

Authors:Zhengnan Sun, Zhaotai Shi, Jiayin Chen, Qingtao Liu, Yu Cui, Qi Ye, Jiming Chen
Title: VTAO-BiManip: Masked Visual-Tactile-Action Pre-training with Object Understanding for Bimanual Dexterous Manipulation
Abstract:
Bimanual dexterous manipulation remains significant challenges in robotics due to the high DoFs of each hand and their coordination. Existing single-hand manipulation techniques often leverage human demonstrations to guide RL methods but fail to generalize to complex bimanual tasks involving multiple sub-skills. In this paper, we introduce VTAO-BiManip, a novel framework that combines visual-tactile-action pretraining with object understanding to facilitate curriculum RL to enable human-like bimanual manipulation. We improve prior learning by incorporating hand motion data, providing more effective guidance for dual-hand coordination than binary tactile feedback. Our pretraining model predicts future actions as well as object pose and size using masked multimodal inputs, facilitating cross-modal regularization. To address the multi-skill learning challenge, we introduce a two-stage curriculum RL approach to stabilize training. We evaluate our method on a bottle-cap unscrewing task, demonstrating its effectiveness in both simulated and real-world environments. Our approach achieves a success rate that surpasses existing visual-tactile pretraining methods by over 20%.
中文: 本文提出的VTAO-BiManip框架通过视觉-触觉-动作预训练与物体理解相结合,采用课程强化学习方法实现拟人化双手操作,在瓶盖拧开任务中比现有方法成功率提升超过20%。
English: This paper presents VTAO-BiManip, a novel framework integrating visual-tactile-action pretraining with object understanding to enable human-like bimanual manipulation through curriculum reinforcement learning, achieving over 20% higher success rates than existing methods in bottle-cap unscrewing tasks.

Authors:Wenxuan Li, Pedro R. A. S. Bassi, Tianyu Lin, Yu-Cheng Chou, Xinze Zhou, Yucheng Tang, Fabian Isensee, Kang Wang, Qi Chen, Xiaowei Xu, Xiaoxi Chen, Lizhou Wu, Qilong Wu, Yannick Kirchhoff, Maximilian Rokuss, Saikat Roy, Yuxuan Zhao, Dexin Yu, Kai Ding, Constantin Ulrich, Klaus Maier-Hein, Yang Yang, Alan L. Yuille, Zongwei Zhou
Title: ScaleMAI: Accelerating the Development of Trusted Datasets and AI Models
Abstract:
Building trusted datasets is critical for transparent and responsible Medical AI (MAI) research, but creating even small, high-quality datasets can take years of effort from multidisciplinary teams. This process often delays AI benefits, as human-centric data creation and AI-centric model development are treated as separate, sequential steps. To overcome this, we propose ScaleMAI, an agent of AI-integrated data curation and annotation, allowing data quality and AI performance to improve in a self-reinforcing cycle and reducing development time from years to months. We adopt pancreatic tumor detection as an example. First, ScaleMAI progressively creates a dataset of 25,362 CT scans, including per-voxel annotations for benign/malignant tumors and 24 anatomical structures. Second, through progressive human-in-the-loop iterations, ScaleMAI provides Flagship AI Model that can approach the proficiency of expert annotators (30-year experience) in detecting pancreatic tumors. Flagship Model significantly outperforms models developed from smaller, fixed-quality datasets, with substantial gains in tumor detection (+14%), segmentation (+5%), and classification (72%) on three prestigious benchmarks. In summary, ScaleMAI transforms the speed, scale, and reliability of medical dataset creation, paving the way for a variety of impactful, data-driven applications.
中文: ScaleMAI提出了一种人工智能集成的数据管理方法,通过自我强化的循环机制将医学数据集创建时间从数年缩短至数月,并在胰腺肿瘤检测中实现接近专家水平的性能,各项基准指标显著提升。
English: ScaleMAI introduces an AI-integrated data curation system that accelerates medical dataset creation from years to months while improving AI performance through a self-reinforcing cycle, demonstrated by achieving expert-level pancreatic tumor detection with significant benchmark improvements.

Authors:Jeffrey Kelling, Vicente Bolea, Michael Bussmann, Ankush Checkervarty, Alexander Debus, Jan Ebert, Greg Eisenhauer, Vineeth Gutta, Stefan Kesselheim, Scott Klasky, Vedhas Pandit, Richard Pausch, Norbert Podhorszki, Franz Poschel, David Rogers, Jeyhun Rustamov, Steve Schmerler, Ulrich Schramm, Klaus Steiniger, Rene Widera, Anna Willmann, Sunita Chandrasekaran
Title: The Artificial Scientist -- in-transit Machine Learning of Plasma Simulations
Abstract:
Increasing HPC cluster sizes and large-scale simulations that produce petabytes of data per run, create massive IO and storage challenges for analysis. Deep learning-based techniques, in particular, make use of these amounts of domain data to extract patterns that help build scientific understanding. Here, we demonstrate a streaming workflow in which simulation data is streamed directly to a machine-learning (ML) framework, circumventing the file system bottleneck. Data is transformed in transit, asynchronously to the simulation and the training of the model. With the presented workflow, data operations can be performed in common and easy-to-use programming languages, freeing the application user from adapting the application output routines. As a proof-of-concept we consider a GPU accelerated particle-in-cell (PIConGPU) simulation of the Kelvin- Helmholtz instability (KHI). We employ experience replay to avoid catastrophic forgetting in learning from this non-steady process in a continual manner. We detail challenges addressed while porting and scaling to Frontier exascale system.
中文: 该摘要介绍了一种流式工作流程,通过将模拟数据直接传输至机器学习框架来规避文件系统瓶颈,实现异步数据转换和训练,同时利用常用编程语言简化用户操作。
English: This abstract presents a streaming workflow that bypasses file system bottlenecks by directly channeling simulation data to a machine-learning framework, enabling asynchronous data transformation and training while simplifying user operations with common programming languages.

Authors:Pegah Khayatan, Mustafa Shukor, Jayneel Parekh, Arnaud Dapogny, Matthieu Cord
Title: Analyzing Finetuning Representation Shift for Multimodal LLMs Steering
Abstract:
Multimodal LLMs (MLLMs) have reached remarkable levels of proficiency in understanding multimodal inputs. However, understanding and interpreting the behavior of such complex models is a challenging task, not to mention the dynamic shifts that may occur during fine-tuning, or due to covariate shift between datasets. In this work, we apply concept-level analysis towards MLLM understanding. More specifically, we propose to map hidden states to interpretable visual and textual concepts. This enables us to more efficiently compare certain semantic dynamics, such as the shift from an original and fine-tuned model, revealing concept alteration and potential biases that may occur during fine-tuning. We also demonstrate the use of shift vectors to capture these concepts changes. These shift vectors allow us to recover fine-tuned concepts by applying simple, computationally inexpensive additive concept shifts in the original model. Finally, our findings also have direct applications for MLLM steering, which can be used for model debiasing as well as enforcing safety in MLLM output. All in all, we propose a novel, training-free, ready-to-use framework for MLLM behavior interpretability and control. Our implementation is publicly available.
中文: 本研究提出无需训练的框架,将多模态大语言模型的隐状态映射至可解释概念,可分析微调引发的语义变化,并通过概念级调整实现模型控制。
English: This study introduces a training-free framework that maps multimodal LLM hidden states to interpretable concepts, enabling analysis of fine-tuning-induced shifts and model control through concept-level adjustments.

Authors:Kairui Fu, Zheqi Lv, Shengyu Zhang, Fan Wu, Kun Kuang
Title: Forward Once for All: Structural Parameterized Adaptation for Efficient Cloud-coordinated On-device Recommendation
Abstract:
In cloud-centric recommender system, regular data exchanges between user devices and cloud could potentially elevate bandwidth demands and privacy risks. On-device recommendation emerges as a viable solution by performing reranking locally to alleviate these concerns. Existing methods primarily focus on developing local adaptive parameters, while potentially neglecting the critical role of tailor-made model architecture. Insights from broader research domains suggest that varying data distributions might favor distinct architectures for better fitting. In addition, imposing a uniform model structure across heterogeneous devices may result in risking inefficacy on less capable devices or sub-optimal performance on those with sufficient capabilities. In response to these gaps, our paper introduces Forward-OFA, a novel approach for the dynamic construction of device-specific networks (both structure and parameters). Forward-OFA employs a structure controller to selectively determine whether each block needs to be assembled for a given device. However, during the training of the structure controller, these assembled heterogeneous structures are jointly optimized, where the co-adaption among blocks might encounter gradient conflicts. To mitigate this, Forward-OFA is designed to establish a structure-guided mapping of real-time behaviors to the parameters of assembled networks. Structure-related parameters and parallel components within the mapper prevent each part from receiving heterogeneous gradients from others, thus bypassing the gradient conflicts for coupled optimization. Besides, direct mapping enables Forward-OFA to achieve adaptation through only one forward pass, allowing for swift adaptation to changing interests and eliminating the requirement for on-device backpropagation. Experiments on real-world datasets demonstrate the effectiveness and efficiency of Forward-OFA.
中文:Forward-OFA提出了一种动态构建设备特定网络的新方法,通过结构控制器避免梯度冲突,并利用单次前向传播实现快速适应,有效提升了设备端推荐系统的效率和性能。
English: Forward-OFA introduces a dynamic, device-specific network construction method that uses a structure controller to avoid gradient conflicts and enables rapid adaptation with a single forward pass, enhancing efficiency and performance in on-device recommender systems.

Authors:Yan Hu, Mingdao Gong, Zhongxi Qiu, Jiabao Liu, Hongli Shen, Mingzhen Yuan, Xiaoqing Zhang, Heng Li, Hai Lu, Jiang Liu
Title: COph100: A comprehensive fundus image registration dataset from infants constituting the "RIDIRP" database
Abstract:
Retinal image registration is vital for diagnostic therapeutic applications within the field of ophthalmology. Existing public datasets, focusing on adult retinal pathologies with high-quality images, have limited number of image pairs and neglect clinical challenges. To address this gap, we introduce COph100, a novel and challenging dataset known as the Comprehensive Ophthalmology Retinal Image Registration dataset for infants with a wide range of image quality issues constituting the public "RIDIRP" database. COph100 consists of 100 eyes, each with 2 to 9 examination sessions, amounting to a total of 491 image pairs carefully selected from the publicly available dataset. We manually labeled the corresponding ground truth image points and provided automatic vessel segmentation masks for each image. We have assessed COph100 in terms of image quality and registration outcomes using state-of-the-art algorithms. This resource enables a robust comparison of retinal registration methodologies and aids in the analysis of disease progression in infants, thereby deepening our understanding of pediatric ophthalmic conditions.
中文摘要:COph100数据集通过提供包含多种图像质量问题的婴儿视网膜图像综合数据集,弥补了现有数据集的不足,为视网膜配准方法的比较和儿科疾病进展分析提供了有力支持。
English Summary: The COph100 dataset addresses the limitations of existing retinal image datasets by providing a comprehensive collection of infant retinal images with varying quality issues, enabling robust registration methodology comparisons and pediatric disease progression analysis.

Authors:Weikang Bian, Zhaoyang Huang, Xiaoyu Shi, Yijin Li, Fu-Yun Wang, Hongsheng Li
Title: GS-DiT: Advancing Video Generation with Pseudo 4D Gaussian Fields through Efficient Dense 3D Point Tracking
Abstract:
4D video control is essential in video generation as it enables the use of sophisticated lens techniques, such as multi-camera shooting and dolly zoom, which are currently unsupported by existing methods. Training a video Diffusion Transformer (DiT) directly to control 4D content requires expensive multi-view videos. Inspired by Monocular Dynamic novel View Synthesis (MDVS) that optimizes a 4D representation and renders videos according to different 4D elements, such as camera pose and object motion editing, we bring pseudo 4D Gaussian fields to video generation. Specifically, we propose a novel framework that constructs a pseudo 4D Gaussian field with dense 3D point tracking and renders the Gaussian field for all video frames. Then we finetune a pretrained DiT to generate videos following the guidance of the rendered video, dubbed as GS-DiT. To boost the training of the GS-DiT, we also propose an efficient Dense 3D Point Tracking (D3D-PT) method for the pseudo 4D Gaussian field construction. Our D3D-PT outperforms SpatialTracker, the state-of-the-art sparse 3D point tracking method, in accuracy and accelerates the inference speed by two orders of magnitude. During the inference stage, GS-DiT can generate videos with the same dynamic content while adhering to different camera parameters, addressing a significant limitation of current video generation models. GS-DiT demonstrates strong generalization capabilities and extends the 4D controllability of Gaussian splatting to video generation beyond just camera poses. It supports advanced cinematic effects through the manipulation of the Gaussian field and camera intrinsics, making it a powerful tool for creative video production. Demos are available at https://wkbian.github.io/Projects/GS-DiT/.
中文: GS-DiT框架通过引入伪4D高斯场实现了先进的四维视频控制,能够生成具有多样化相机参数和电影特效的动态内容,有效克服了现有视频生成方法的局限性。
English: The GS-DiT framework introduces pseudo 4D Gaussian fields to enable advanced 4D video control, allowing dynamic content generation with diverse camera parameters and cinematic effects, overcoming the limitations of existing video generation methods.

Authors:Yaxian Wang, Henghui Ding, Shuting He, Xudong Jiang, Bifan Wei, Jun Liu
Title: Hierarchical Alignment-enhanced Adaptive Grounding Network for Generalized Referring Expression Comprehension
Abstract:
In this work, we address the challenging task of Generalized Referring Expression Comprehension (GREC). Compared to the classic Referring Expression Comprehension (REC) that focuses on single-target expressions, GREC extends the scope to a more practical setting by further encompassing no-target and multi-target expressions. Existing REC methods face challenges in handling the complex cases encountered in GREC, primarily due to their fixed output and limitations in multi-modal representations. To address these issues, we propose a Hierarchical Alignment-enhanced Adaptive Grounding Network (HieA2G) for GREC, which can flexibly deal with various types of referring expressions. First, a Hierarchical Multi-modal Semantic Alignment (HMSA) module is proposed to incorporate three levels of alignments, including word-object, phrase-object, and text-image alignment. It enables hierarchical cross-modal interactions across multiple levels to achieve comprehensive and robust multi-modal understanding, greatly enhancing grounding ability for complex cases. Then, to address the varying number of target objects in GREC, we introduce an Adaptive Grounding Counter (AGC) to dynamically determine the number of output targets. Additionally, an auxiliary contrastive loss is employed in AGC to enhance object-counting ability by pulling in multi-modal features with the same counting and pushing away those with different counting. Extensive experimental results show that HieA2G achieves new state-of-the-art performance on the challenging GREC task and also the other 4 tasks, including REC, Phrase Grounding, Referring Expression Segmentation (RES), and Generalized Referring Expression Segmentation (GRES), demonstrating the remarkable superiority and generalizability of the proposed HieA2G.
中文摘要:本文提出HieA2G网络,通过分层多模态对齐和自适应定位机制,能灵活处理广义指代表达理解任务,在多个视觉语言任务中实现了最优性能。
English Summary: This paper introduces HieA2G, a novel network that uses hierarchical multi-modal alignment and adaptive grounding to flexibly handle generalized referring expression comprehension, achieving state-of-the-art performance across multiple vision-language tasks.

Authors:Yuxiang Chai, Hanhao Li, Jiayu Zhang, Liang Liu, Guangyi Liu, Guozhi Wang, Shuai Ren, Siyuan Huang, Hongsheng Li
Title: A3: Android Agent Arena for Mobile GUI Agents
Abstract:
AI agents have become increasingly prevalent in recent years, driven by significant advancements in the field of large language models (LLMs). Mobile GUI agents, a subset of AI agents, are designed to autonomously perform tasks on mobile devices. While numerous studies have introduced agents, datasets, and benchmarks to advance mobile GUI agent research, many existing datasets focus on static frame evaluations and fail to provide a comprehensive platform for assessing performance on real-world, in-the-wild tasks. To address this gap, we present Android Agent Arena (A3), a novel evaluation platform. Unlike existing in-the-wild systems, A3 offers: (1) meaningful and practical tasks, such as real-time online information retrieval and operational instructions; (2) a larger, more flexible action space, enabling compatibility with agents trained on any dataset; and (3) automated business-level LLM-based evaluation process. A3 includes 21 widely used general third-party apps and 201 tasks representative of common user scenarios, providing a robust foundation for evaluating mobile GUI agents in real-world situations and a new autonomous evaluation process for less human labor and coding expertise. The project is available at https://yuxiangchai.github.io/Android-Agent-Arena/.
中文摘要:Android Agent Arena (A3) 作为一种新型评估平台,通过提供实际任务、灵活操作空间和基于大语言模型的自动化评估,解决了现有移动GUI代理数据集的局限性,适用于真实场景的性能测试。
English Summary: Android Agent Arena (A3) is introduced as a novel evaluation platform addressing the limitations of existing mobile GUI agent datasets by offering practical tasks, a flexible action space, and automated LLM-based evaluation for real-world scenarios.

Authors:Tao Feng, Wei Li, Didi Zhu, Hangjie Yuan, Wendi Zheng, Dan Zhang, Jie Tang
Title: ZeroFlow: Overcoming Catastrophic Forgetting is Easier than You Think
Abstract:
Backpropagation provides a generalized configuration for overcoming catastrophic forgetting. Optimizers such as SGD and Adam are commonly used for weight updates in continual learning and continual pre-training. However, access to gradient information is not always feasible in practice due to black-box APIs, hardware constraints, or non-differentiable systems, a challenge we refer to as the gradient bans. To bridge this gap, we introduce ZeroFlow, the first benchmark designed to evaluate gradient-free optimization algorithms for overcoming forgetting. ZeroFlow examines a suite of forward pass-based methods across various algorithms, forgetting scenarios, and datasets. Our results show that forward passes alone can be sufficient to mitigate forgetting. We uncover novel optimization principles that highlight the potential of forward pass-based methods in mitigating forgetting, managing task conflicts, and reducing memory demands. Additionally, we propose new enhancements that further improve forgetting resistance using only forward passes. This work provides essential tools and insights to advance the development of forward-pass-based methods for continual learning.
中文摘要:ZeroFlow是首个评估持续学习中无梯度优化方法的基准,证明仅通过前向传播即可有效缓解遗忘问题,并提出了新的改进措施以提升性能。
English Summary: ZeroFlow is the first benchmark for evaluating gradient-free optimization methods in continual learning, demonstrating that forward passes alone can effectively mitigate forgetting and offering new enhancements for improved performance.

Authors:Dongnan Xia, Cunhua Pan, Hong Ren, Zhiyuan Yu, Yasheng Jin, Jiangzhou Wang
Title: RIS-Aided Integrated Sensing and Communication Systems under Dual-polarized Channels
Abstract:
This paper considers reconfigurable intelligent surface (RIS)-aided integrated sensing and communication (ISAC) systems under dual-polarized (DP) channels. Unlike the existing ISAC systems, which ignored polarization of electromagnetic waves, this study adopts DP base station (BS) and DP RIS to serve users with a pair of DP antennas. The achievable sum rate is maximized through jointly optimizing the beamforming matrix at the DP BS, and the reflecting coefficients at the DP RIS. To address this problem, we first utilize the weighted minimum mean-square error (WMMSE) method to transform the objective function into a more tractable form, and then an alternating optimization (AO) method is employed to decouple the original problem into two subproblems. Due to the constant modulus constraint, the DP RIS reflection matrix optimization problem is addressed by the majorization-minimization (MM) method. For the DP beamforming matrix, we propose a penalty-based algorithm that can obtain a low-complexity closed-form solution. Simulation results validate the advantage of deploying DP transmit array and DP RIS in the considered ISAC systems.
中文摘要:本研究提出了一种双极化智能反射面辅助的集成传感与通信系统,通过加权最小均方误差和交替优化方法联合优化波束成形与反射系数以最大化和速率,仿真验证了该系统的性能优势。
English Summary: This study introduces a dual-polarized RIS-assisted ISAC system that jointly optimizes beamforming and reflection coefficients using WMMSE and alternating optimization methods to maximize sum rate, demonstrating performance advantages through simulations.

Authors:Lynnette Hui Xian Ng, Kathleen M. Carley
Title: What is a Social Media Bot? A Global Comparison of Bot and Human Characteristics
Abstract:
Chatter on social media is 20% bots and 80% humans. Chatter by bots and humans is consistently different: bots tend to use linguistic cues that can be easily automated while humans use cues that require dialogue understanding. Bots use words that match the identities they choose to present, while humans may send messages that are not related to the identities they present. Bots and humans differ in their communication structure: sampled bots have a star interaction structure, while sampled humans have a hierarchical structure. These conclusions are based on a large-scale analysis of social media tweets across ~200mil users across 7 events. Social media bots took the world by storm when social-cybersecurity researchers realized that social media users not only consisted of humans but also of artificial agents called bots. These bots wreck havoc online by spreading disinformation and manipulating narratives. Most research on bots are based on special-purposed definitions, mostly predicated on the event studied. This article first begins by asking, "What is a bot?", and we study the underlying principles of how bots are different from humans. We develop a first-principle definition of a social media bot. With this definition as a premise, we systematically compare characteristics between bots and humans across global events, and reflect on how the software-programmed bot is an Artificial Intelligent algorithm, and its potential for evolution as technology advances. Based on our results, we provide recommendations for the use and regulation of bots. Finally, we discuss open challenges and future directions: Detect, to systematically identify these automated and potentially evolving bots; Differentiate, to evaluate the goodness of the bot in terms of their content postings and relationship interactions; Disrupt, to moderate the impact of malicious bots.
中文摘要:社交媒体中20%为机器人账号,80%为人类用户,机器人倾向于使用可自动化的语言特征并呈现星型交互结构,而人类采用需要对话理解的自然交流方式并形成层级互动结构,这一结论基于对7个事件中约2亿用户的大规模分析得出。
English Summary: Social media chatter is composed of 20% bots and 80% humans, with bots using automated linguistic patterns and structured interactions while humans employ dialogue-based communication, as revealed through large-scale analysis of 200 million users across seven events.

Authors:Mingqi Gao, Yixin Liu, Xinyu Hu, Xiaojun Wan, Jonathan Bragg, Arman Cohan
Title: Re-evaluating Automatic LLM System Ranking for Alignment with Human Preference
Abstract:
Evaluating and ranking the capabilities of different LLMs is crucial for understanding their performance and alignment with human preferences. Due to the high cost and time-consuming nature of human evaluations, an automatic LLM bencher (i.e., an automatic evaluation framework that aims to rank LLMs based on their alignment with human preferences) is indispensable. An automatic LLM bencher consists of four components: the input set (e.g., a user instruction), the evaluation model (e.g., an LLM), the evaluation type (e.g., pairwise comparison), and the aggregation method (e.g., the ELO rating system). However, previous work has not thoroughly explored how to select these components or how their different combinations influence the results. In this work, through controlled experiments, we provide a series of recommendations on how to choose each component to better automate the evaluation of LLMs. Furthermore, we discovered that when evaluating LLMs with similar performance, the performance of the automatic LLM bencher declines sharply, underscoring the limitations of current benchers and calling for future work. Lastly, we found that the evaluation models' performance at the instance level (e.g., the accuracy of selecting the best output) does not always align with their effectiveness when used as a component of a bencher, highlighting the importance of dedicated system-level evaluation of benchers.
中文: 本研究为自动评估大语言模型框架的组件选择提供了建议,揭示了其在区分性能相近模型时的局限性,并强调了系统级评估的必要性。
English: This study provides recommendations for selecting components in automatic LLM evaluation frameworks and reveals their limitations in distinguishing closely performing models, emphasizing the need for system-level assessment.

Authors:Zijun Deng, Rafael Orozco, Abhinav Prakash Gahlot, Felix J. Herrmann
Title: Probabilistic Joint Recovery Method for CO$_2$ Plume Monitoring
Abstract:
Reducing CO$_2$ emissions is crucial to mitigating climate change. Carbon Capture and Storage (CCS) is one of the few technologies capable of achieving net-negative CO$_2$ emissions. However, predicting fluid flow patterns in CCS remains challenging due to uncertainties in CO$_2$ plume dynamics and reservoir properties. Building on existing seismic imaging methods like the Joint Recovery Method (JRM), which lacks uncertainty quantification, we propose the Probabilistic Joint Recovery Method (pJRM). By estimating posterior distributions across surveys using a shared generative model, pJRM provides uncertainty information to improve risk assessment in CCS projects.
中文: 概率联合恢复方法(pJRM)通过共享生成模型估计跨勘测的后验分布,为碳捕集与封存项目提供不确定性信息,从而改进现有地震成像方法在风险评估中的不足。
English: The Probabilistic Joint Recovery Method (pJRM) enhances CCS risk assessment by quantifying uncertainties in CO₂ plume dynamics through posterior distribution estimation, addressing limitations in existing seismic imaging techniques.

Authors:Shubham Aggarwal, Muhammad Aneeq uz Zaman, Melih Bastopcu, Sennur Ulukus, Tamer Başar
Title: Distributed Offloading in Multi-Access Edge Computing Systems: A Mean-Field Perspective
Abstract:
Multi-access edge computing (MEC) technology is a promising solution to assist power-constrained IoT devices by providing additional computing resources for time-sensitive tasks. In this paper, we consider the problem of optimal task offloading in MEC systems with due consideration of the timeliness and scalability issues under two scenarios of equitable and priority access to the edge server (ES). In the first scenario, we consider a MEC system consisting of $N$ devices assisted by one ES, where the devices can split task execution between a local processor and the ES, with equitable access to the ES. In the second scenario, we consider a MEC system consisting of one primary user, $N$ secondary users and one ES. The primary user has priority access to the ES while the secondary users have equitable access to the ES amongst themselves. In both scenarios, due to the power consumption associated with utilizing the local resource and task offloading, the devices must optimize their actions. Additionally, since the ES is a shared resource, other users' offloading activity serves to increase latency incurred by each user. We thus model both scenarios using a non-cooperative game framework. However, the presence of a large number of users makes it nearly impossible to compute the equilibrium offloading policies for each user, which would require a significant information exchange overhead between users. Thus, to alleviate such scalability issues, we invoke the paradigm of mean-field games to compute approximate Nash equilibrium policies for each user using their local information, and further study the trade-offs between increasing information freshness and reducing power consumption for each user. Using numerical evaluations, we show that our approach can recover the offloading trends displayed under centralized solutions, and provide additional insights into the results obtained.
中文: 本文通过平均场博弈理论为多接入边缘计算系统开发了可扩展的任务卸载框架,在公平访问和优先级访问两种场景下优化了延迟与功耗的平衡关系。
English: This paper develops a scalable task offloading framework for multi-access edge computing systems using mean-field game theory to optimize latency-power tradeoffs under both equitable and priority server access scenarios.

Authors:Dario Coscia, Max Welling, Nicola Demo, Gianluigi Rozza
Title: BARNN: A Bayesian Autoregressive and Recurrent Neural Network
Abstract:
Autoregressive and recurrent networks have achieved remarkable progress across various fields, from weather forecasting to molecular generation and Large Language Models. Despite their strong predictive capabilities, these models lack a rigorous framework for addressing uncertainty, which is key in scientific applications such as PDE solving, molecular generation and Machine Learning Force Fields. To address this shortcoming we present BARNN: a variational Bayesian Autoregressive and Recurrent Neural Network. BARNNs aim to provide a principled way to turn any autoregressive or recurrent model into its Bayesian version. BARNN is based on the variational dropout method, allowing to apply it to large recurrent neural networks as well. We also introduce a temporal version of the "Variational Mixtures of Posteriors" prior (tVAMP-prior) to make Bayesian inference efficient and well-calibrated. Extensive experiments on PDE modelling and molecular generation demonstrate that BARNN not only achieves comparable or superior accuracy compared to existing methods, but also excels in uncertainty quantification and modelling long-range dependencies.
中文: BARNN作为一种变分贝叶斯框架,可将自回归和循环模型转化为贝叶斯版本,在偏微分方程建模和分子生成等任务中不仅保持或超越原有精度,更显著提升了不确定性量化能力。
English: BARNN is a variational Bayesian framework that transforms autoregressive and recurrent models into their Bayesian versions, enhancing uncertainty quantification while maintaining or surpassing accuracy in tasks like PDE modeling and molecular generation.

Authors:Nasir Khan, Asmaa Abdallah, Abdulkadir Celik, Ahmed M. Eltawil, Sinem Coleri
Title: Explainable and Robust Millimeter Wave Beam Alignment for AI-Native 6G Networks
Abstract:
Integrated artificial intelligence (AI) and communication has been recognized as a key pillar of 6G and beyond networks. In line with AI-native 6G vision, explainability and robustness in AI-driven systems are critical for establishing trust and ensuring reliable performance in diverse and evolving environments. This paper addresses these challenges by developing a robust and explainable deep learning (DL)-based beam alignment engine (BAE) for millimeter-wave (mmWave) multiple-input multiple-output (MIMO) systems. The proposed convolutional neural network (CNN)-based BAE utilizes received signal strength indicator (RSSI) measurements over a set of wide beams to accurately predict the best narrow beam for each UE, significantly reducing the overhead associated with exhaustive codebook-based narrow beam sweeping for initial access (IA) and data transmission. To ensure transparency and resilience, the Deep k-Nearest Neighbors (DkNN) algorithm is employed to assess the internal representations of the network via nearest neighbor approach, providing human-interpretable explanations and confidence metrics for detecting out-of-distribution inputs. Experimental results demonstrate that the proposed DL-based BAE exhibits robustness to measurement noise, reduces beam training overhead by 75% compared to the exhaustive search while maintaining near-optimal performance in terms of spectral efficiency. Moreover, the proposed framework improves outlier detection robustness by up to 5x and offers clearer insights into beam prediction decisions compared to traditional softmax-based classifiers.
中文摘要:本文针对6G毫米波MIMO系统提出了一种鲁棒且可解释的深度学习波束对准方案,通过结合CNN和DkNN算法,在减少75%波束训练开销的同时,显著提升了系统透明度和异常检测能力。
English Summary: This paper introduces a robust and explainable deep learning-based beam alignment system for 6G mmWave MIMO networks, using CNN and DkNN algorithms to reduce beam training overhead by 75% while enhancing transparency and outlier detection.

Authors:Yihua Shao, Minxi Yan, Yang Liu, Siyu Chen, Wenjie Chen, Xinwei Long, Ziyang Yan, Lei Li, Chenyu Zhang, Nicu Sebe, Hao Tang, Yan Wang, Hao Zhao, Mengzhu Wang, Jingcai Guo
Title: In-Context Meta LoRA Generation
Abstract:
Low-rank Adaptation (LoRA) has demonstrated remarkable capabilities for task specific fine-tuning. However, in scenarios that involve multiple tasks, training a separate LoRA model for each one results in considerable inefficiency in terms of storage and inference. Moreover, existing parameter generation methods fail to capture the correlations among these tasks, making multi-task LoRA parameter generation challenging. To address these limitations, we propose In-Context Meta LoRA (ICM-LoRA), a novel approach that efficiently achieves task-specific customization of large language models (LLMs). Specifically, we use training data from all tasks to train a tailored generator, Conditional Variational Autoencoder (CVAE). CVAE takes task descriptions as inputs and produces task-aware LoRA weights as outputs. These LoRA weights are then merged with LLMs to create task-specialized models without the need for additional fine-tuning. Furthermore, we utilize in-context meta-learning for knowledge enhancement and task mapping, to capture the relationship between tasks and parameter distributions. As a result, our method achieves more accurate LoRA parameter generation for diverse tasks using CVAE. ICM-LoRA enables more accurate LoRA parameter reconstruction than current parameter reconstruction methods and is useful for implementing task-specific enhancements of LoRA parameters. At the same time, our method occupies 283MB, only 1\% storage compared with the original LoRA.
中文: ICM-LoRA采用条件变分自编码器根据任务描述生成针对性LoRA参数,通过元学习捕捉任务关联性,在实现多任务高效适配的同时将存储占用降至原始方法的1%。
English: ICM-LoRA introduces a conditional variational autoencoder to generate task-specific LoRA parameters from descriptions, enabling efficient multi-task adaptation with minimal storage while capturing inter-task relationships through meta-learning.

Authors:Xiaobei Wang, Shuchang Liu, Qingpeng Cai, Xiang Li, Lantao Hu, Han li, Guangming Xie
Title: Value Function Decomposition in Markov Recommendation Process
Abstract:
Recent advances in recommender systems have shown that user-system interaction essentially formulates long-term optimization problems, and online reinforcement learning can be adopted to improve recommendation performance. The general solution framework incorporates a value function that estimates the user's expected cumulative rewards in the future and guides the training of the recommendation policy. To avoid local maxima, the policy may explore potential high-quality actions during inference to increase the chance of finding better future rewards. To accommodate the stepwise recommendation process, one widely adopted approach to learning the value function is learning from the difference between the values of two consecutive states of a user. However, we argue that this paradigm involves a challenge of Mixing Random Factors: there exist two random factors from the stochastic policy and the uncertain user environment, but they are not separately modeled in the standard temporal difference (TD) learning, which may result in a suboptimal estimation of the long-term rewards and less effective action exploration. As a solution, we show that these two factors can be separately approximated by decomposing the original temporal difference loss. The disentangled learning framework can achieve a more accurate estimation with faster learning and improved robustness against action exploration. As an empirical verification of our proposed method, we conduct offline experiments with simulated online environments built on the basis of public datasets.
中文摘要:近期研究提出一种新型强化学习框架,通过解耦时序差分学习中的策略不确定性和用户环境不确定性,实现了更精准的长期收益预估,从而提升推荐系统性能。
English Summary: Recent research proposes a novel reinforcement learning framework for recommender systems that disentangles policy and environmental uncertainties in temporal difference learning, leading to more accurate long-term reward estimation and improved recommendation performance.

Authors:Zeeshan Rasheed, Muhammad Waseem, Kai Kristian Kemell, Aakash Ahmad, Malik Abdul Sami, Jussi Rasku, Kari Systä, Pekka Abrahamsson
Title: Large Language Models for Code Generation: The Practitioners Perspective
Abstract:
Large Language Models (LLMs) have emerged as coding assistants, capable of generating source code from natural language prompts. With the increasing adoption of LLMs in software development, academic research and industry based projects are developing various tools, benchmarks, and metrics to evaluate the effectiveness of LLM-generated code. However, there is a lack of solutions evaluated through empirically grounded methods that incorporate practitioners perspectives to assess functionality, syntax, and accuracy in real world applications. To address this gap, we propose and develop a multi-model unified platform to generate and execute code based on natural language prompts. We conducted a survey with 60 software practitioners from 11 countries across four continents working in diverse professional roles and domains to evaluate the usability, performance, strengths, and limitations of each model. The results present practitioners feedback and insights into the use of LLMs in software development, including their strengths and weaknesses, key aspects overlooked by benchmarks and metrics, and a broader understanding of their practical applicability. These findings can help researchers and practitioners make informed decisions for systematically selecting and using LLMs in software development projects. Future research will focus on integrating more diverse models into the proposed system, incorporating additional case studies, and conducting developer interviews for deeper empirical insights into LLM-driven software development.
中文:大型语言模型正日益成为编程助手,但现有评估方法缺乏实证基础和从业者视角,为此开发了一个统一平台来生成和执行代码,并收集了来自四大洲11个国家60名软件从业者的反馈,以评估其可用性、性能及实际应用价值。
English: Large language models are increasingly used as coding assistants, but current evaluation methods lack empirical grounding and practitioner input, prompting the development of a unified platform for code generation and execution that incorporates feedback from 60 global software professionals to assess usability, performance, and practical applicability.

Authors:Yunbo Long, Liming Xu, Stefan Schoepf, Alexandra Brintrup
Title: Random Walk Guided Hyperbolic Graph Distillation
Abstract:
Graph distillation (GD) is an effective approach to extract useful information from large-scale network structures. However, existing methods, which operate in Euclidean space to generate condensed graphs, struggle to capture the inherent tree-like geometry of real-world networks, resulting in distilled graphs with limited task-specific information for downstream tasks. Furthermore, these methods often fail to extract dynamic properties from graphs, which are crucial for understanding information flow and facilitating graph continual learning. This paper presents the Hyperbolic Graph Distillation with Random Walks Optimization (HyDRO), a novel graph distillation approach that leverages hyperbolic embeddings to capture complex geometric patterns and optimize the spectral gap in hyperbolic space. Experiments show that HyDRO demonstrates strong task generalization, consistently outperforming state-of-the-art methods in both node classification and link prediction tasks. HyDRO also effectively preserves graph random walk properties, producing condensed graphs that achieve enhanced performance in continual graph learning. Additionally, HyDRO achieves competitive results on mainstream graph distillation benchmarks, while maintaining a strong balance between privacy and utility, and exhibiting robust resistance to noises.
中文: HyDRO提出了一种双曲图蒸馏方法,通过捕捉复杂几何特征和优化谱特性,在节点分类、链接预测和持续学习任务中显著优于现有技术,同时确保了隐私保护和抗噪能力。
English: HyDRO introduces a hyperbolic graph distillation method that captures complex geometries and optimizes spectral properties, significantly outperforming existing techniques in node classification, link prediction, and continual learning while ensuring privacy and noise resistance.

Authors:Song Chen, Xinyu Guo, Yadong Li, Tao Zhang, Mingan Lin, Dongdong Kuang, Youwei Zhang, Lingfeng Ming, Fengyu Zhang, Yuran Wang, Jianhua Xu, Zenan Zhou, Weipeng Chen
Title: Ocean-OCR: Towards General OCR Application via a Vision-Language Model
Abstract:
Multimodal large language models (MLLMs) have shown impressive capabilities across various domains, excelling in processing and understanding information from multiple modalities. Despite the rapid progress made previously, insufficient OCR ability hinders MLLMs from excelling in text-related tasks. In this paper, we present \textbf{Ocean-OCR}, a 3B MLLM with state-of-the-art performance on various OCR scenarios and comparable understanding ability on general tasks. We employ Native Resolution ViT to enable variable resolution input and utilize a substantial collection of high-quality OCR datasets to enhance the model performance. We demonstrate the superiority of Ocean-OCR through comprehensive experiments on open-source OCR benchmarks and across various OCR scenarios. These scenarios encompass document understanding, scene text recognition, and handwritten recognition, highlighting the robust OCR capabilities of Ocean-OCR. Note that Ocean-OCR is the first MLLM to outperform professional OCR models such as TextIn and PaddleOCR.
中文: Ocean-OCR作为首个超越专业OCR模型的多模态大语言模型,在各类OCR场景中展现出卓越性能,同时保持优秀的通用任务理解能力。
English: Ocean-OCR is a 3B multimodal large language model that achieves state-of-the-art performance in various OCR scenarios, surpassing professional OCR models while maintaining strong general task understanding capabilities.

Authors:Jiajie Li, Brian R Quaranto, Chenhui Xu, Ishan Mishra, Ruiyang Qin, Dancheng Liu, Peter C W Kim, Jinjun Xiong
Title: Recognize Any Surgical Object: Unleashing the Power of Weakly-Supervised Data
Abstract:
We present RASO, a foundation model designed to Recognize Any Surgical Object, offering robust open-set recognition capabilities across a broad range of surgical procedures and object classes, in both surgical images and videos. RASO leverages a novel weakly-supervised learning framework that generates tag-image-text pairs automatically from large-scale unannotated surgical lecture videos, significantly reducing the need for manual annotations. Our scalable data generation pipeline gathers 2,200 surgical procedures and produces 3.6 million tag annotations across 2,066 unique surgical tags. Our experiments show that RASO achieves improvements of 2.9 mAP, 4.5 mAP, 10.6 mAP, and 7.2 mAP on four standard surgical benchmarks, respectively, in zero-shot settings, and surpasses state-of-the-art models in supervised surgical action recognition tasks. Code, model, and demo are available at https://ntlm1686.github.io/raso.
Chinese: RASO是一种基础模型,通过弱监督学习框架在手术图像和视频中稳健识别各类手术对象,在多个标准手术基准的零样本和监督任务中均实现了显著性能提升。
English: RASO is a foundation model that robustly recognizes any surgical object in images and videos using a weakly-supervised learning framework, achieving significant performance improvements in zero-shot and supervised tasks across multiple surgical benchmarks.

Authors:Yeyue Cai, Meixia Tao, Jianhua Mo, Shu Sun
Title: Hybrid Near/Far-Field Frequency-Dependent Beamforming via Joint Phase-Time Arrays
Abstract:
Joint phase-time arrays (JPTA) emerge as a cost-effective and energy-efficient architecture for frequency-dependent beamforming in wideband communications by utilizing both true-time delay units and phase shifters. This paper exploits the potential of JPTA to simultaneously serve multiple users in both near- and far-field regions with a single radio frequency chain. The goal is to jointly optimize JPTA-based beamforming and subband allocation to maximize overall system performance. To this end, we formulate a system utility maximization problem, including sum-rate maximization and proportional fairness as special cases. We develop a 3-step alternating optimization (AO) algorithm and an efficient deep learning (DL) method for this problem. The DL approach includes a 2-layer convolutional neural network, a 3-layer graph attention network (GAT), and a normalization module for resource and beamforming optimization. The GAT efficiently captures the interactions between resource allocation and analog beamformers. Simulation results confirm that JPTA outperforms conventional phased arrays (PA) in enhancing user rate and strikes a good balance between PA and fully-digital approach in energy efficiency. Employing a logarithmic utility function for user rates ensures greater fairness than maximizing sum-rates. Furthermore, the DL network achieves comparable performance to the AO approach, while having orders of magnitude lower computational complexity.
中文: 联合相位时间阵列(JPTA)通过单射频链实现高效宽带多用户波束成形,采用交替优化和深度学习方法优化,在速率和公平性上超越传统相控阵,同时显著降低计算复杂度。
English: Joint phase-time arrays (JPTA) enable efficient wideband multi-user beamforming with a single RF chain, optimized via alternating optimization and deep learning methods to outperform phased arrays in rate and fairness while reducing complexity.

Authors:Yuntong Hu, Zhihan Lei, Zhongjie Dai, Allen Zhang, Abhinav Angirekula, Zheng Zhang, Liang Zhao
Title: CG-RAG: Research Question Answering by Citation Graph Retrieval-Augmented LLMs
Abstract:
Research question answering requires accurate retrieval and contextual understanding of scientific literature. However, current Retrieval-Augmented Generation (RAG) methods often struggle to balance complex document relationships with precise information retrieval. In this paper, we introduce Contextualized Graph Retrieval-Augmented Generation (CG-RAG), a novel framework that integrates sparse and dense retrieval signals within graph structures to enhance retrieval efficiency and subsequently improve generation quality for research question answering. First, we propose a contextual graph representation for citation graphs, effectively capturing both explicit and implicit connections within and across documents. Next, we introduce Lexical-Semantic Graph Retrieval (LeSeGR), which seamlessly integrates sparse and dense retrieval signals with graph encoding. It bridges the gap between lexical precision and semantic understanding in citation graph retrieval, demonstrating generalizability to existing graph retrieval and hybrid retrieval methods. Finally, we present a context-aware generation strategy that utilizes the retrieved graph-structured information to generate precise and contextually enriched responses using large language models (LLMs). Extensive experiments on research question answering benchmarks across multiple domains demonstrate that our CG-RAG framework significantly outperforms RAG methods combined with various state-of-the-art retrieval approaches, delivering superior retrieval accuracy and generation quality.
中文: 本文提出的CG-RAG框架通过在图结构中融合稀疏与稠密检索信号,有效提升了科研问答的检索精度和生成质量,显著优于现有最先进方法。
English: This paper introduces CG-RAG, a novel framework that enhances research question answering by integrating sparse and dense retrieval signals within graph structures, significantly outperforming existing methods in both retrieval accuracy and generation quality.

Authors:Chongzhi Zhang, Junhao Zheng, Zhiping Peng, Qianli Ma
Title: Neural-Symbolic Message Passing with Dynamic Pruning
Abstract:
Complex Query Answering (CQA) over incomplete Knowledge Graphs (KGs) is a challenging task. Recently, a line of message-passing-based research has been proposed to solve CQA. However, they perform unsatisfactorily on negative queries and fail to address the noisy messages between variable nodes in the query graph. Moreover, they offer little interpretability and require complex query data and resource-intensive training. In this paper, we propose a Neural-Symbolic Message Passing (NSMP) framework based on pre-trained neural link predictors. By introducing symbolic reasoning and fuzzy logic, NSMP can generalize to arbitrary existential first order logic queries without requiring training while providing interpretable answers. Furthermore, we introduce a dynamic pruning strategy to filter out noisy messages between variable nodes. Experimental results show that NSMP achieves a strong performance. Additionally, through complexity analysis and empirical verification, we demonstrate the superiority of NSMP in inference time over the current state-of-the-art neural-symbolic method. Compared to this approach, NSMP demonstrates faster inference times across all query types on benchmark datasets, with speedup ranging from 2$\times$ to over 150$\times$.
中文: 本文提出的神经符号消息传递(NSMP)框架通过引入符号推理和动态剪枝策略,解决了复杂查询应答中的噪声传递和负查询处理问题,在无需训练的情况下实现了优越性能,且推理速度比现有最优方法快2至150倍。
English: The proposed Neural-Symbolic Message Passing (NSMP) framework addresses limitations in complex query answering over incomplete knowledge graphs by incorporating symbolic reasoning and dynamic pruning, achieving superior performance and significantly faster inference times without requiring training.

Authors:Peng Xue, Wei Fang, Zhengyu Ma, Zihan Huang, Zhaokun Zhou, Yonghong Tian, Timothée Masquelier, Huihui Zhou
Title: Channel-wise Parallelizable Spiking Neuron with Multiplication-free Dynamics and Large Temporal Receptive Fields
Abstract:
Spiking Neural Networks (SNNs) are distinguished from Artificial Neural Networks (ANNs) for their sophisticated neuronal dynamics and sparse binary activations (spikes) inspired by the biological neural system. Traditional neuron models use iterative step-by-step dynamics, resulting in serial computation and slow training speed of SNNs. Recently, parallelizable spiking neuron models have been proposed to fully utilize the massive parallel computing ability of graphics processing units to accelerate the training of SNNs. However, existing parallelizable spiking neuron models involve dense floating operations and can only achieve high long-term dependencies learning ability with a large order at the cost of huge computational and memory costs. To solve the dilemma of performance and costs, we propose the mul-free channel-wise Parallel Spiking Neuron, which is hardware-friendly and suitable for SNNs' resource-restricted application scenarios. The proposed neuron imports the channel-wise convolution to enhance the learning ability, induces the sawtooth dilations to reduce the neuron order, and employs the bit shift operation to avoid multiplications. The algorithm for design and implementation of acceleration methods is discussed meticulously. Our methods are validated in neuromorphic Spiking Heidelberg Digits voices, sequential CIFAR images, and neuromorphic DVS-Lip vision datasets, achieving the best accuracy among SNNs. Training speed results demonstrate the effectiveness of our acceleration methods, providing a practical reference for future research.
中文: 提出的无乘法通道并行脉冲神经元通过硬件友好的位运算降低计算成本,同时在多个数据集上实现最高精度和加速训练,有效平衡性能与资源消耗。
English: The proposed mul-free channel-wise Parallel Spiking Neuron enhances learning ability while reducing computational costs through hardware-friendly operations like bit shifting, achieving top accuracy and accelerated training across multiple datasets.

Authors:Jonathan Will, Nico Treide, Lauritz Thamsen, Odej Kao
Title: Experimentally Evaluating the Resource Efficiency of Big Data Autoscaling
Abstract:
Distributed dataflow systems like Spark and Flink enable data-parallel processing of large datasets on clusters. Yet, selecting appropriate computational resources for dataflow jobs is often challenging. For efficient execution, individual resource allocations, such as memory and CPU cores, must meet the specific resource requirements of the job. An alternative to selecting a static resource allocation for a job execution is autoscaling as implemented for example by Spark. In this paper, we evaluate the resource efficiency of autoscaling batch data processing jobs based on resource demand both conceptually and experimentally by analyzing a new dataset of Spark job executions on Google Dataproc Serverless. In our experimental evaluation, we show that there is no significant resource efficiency gain over static resource allocations. We found that the inherent conceptual limitations of such autoscaling approaches are the inelasticity of node size as well as the inelasticity of the ratio of memory to CPU cores.
中文摘要:分布式数据流系统(如Spark)中的自动扩缩容由于节点规模和内存与CPU核心比例缺乏弹性,相较于静态资源配置并未带来显著的资源效率提升。
English Summary: Autoscaling in distributed dataflow systems like Spark shows no significant resource efficiency improvement over static allocations due to inherent limitations in node size and memory-to-CPU ratio inelasticity.